History log of /freebsd-10.0-release/sys/cddl/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
274110 04-Nov-2014 des

[SA-14:24] Fix denial of service attack against sshd(8).
[SA-14:25] Fix kernel stack disclosure in setlogin(2) / getlogin(2).
[SA-14:26] Fix remote command execution in ftp(1).
[EN-14:12] Fix NFSv4 and ZFS cache consistency issue.

Approved by: so (des)

259738 22-Dec-2013 pjd

MFC r259734:

MFV r258923: 4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0

illumos/illumos-gate@bb411a08b05466bfe0c7095b6373bbc1587e259a

Approved by: re (gjb)

259128 09-Dec-2013 gjb

Remove svn:mergeinfo from the releng/10.0 branch.

After branch creation from stable/10, the stable/10 branch mergeinfo
was moved to the root of the branch.

Since there have not been any merges from stable/10 to releng/10.0
yet, we do not need to track any of the existing mergeinfo here.

Merges to releng/10.0 should now be done to the root of the branch.

For future branches during the release cycle, unless otherwise noted,
this change will be done as part of the stable/ and releng/ branch
creation.

Discussed with: peter
Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-10.0-release/MAINTAINERS
/freebsd-10.0-release/Makefile.inc1
/freebsd-10.0-release/ObsoleteFiles.inc
/freebsd-10.0-release/UPDATING
/freebsd-10.0-release/bin/df
/freebsd-10.0-release/bin/freebsd-version
/freebsd-10.0-release/cddl
/freebsd-10.0-release/cddl/contrib/opensolaris
/freebsd-10.0-release/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-10.0-release/cddl/contrib/opensolaris/cmd/zfs
/freebsd-10.0-release/cddl/contrib/opensolaris/lib/libzfs
/freebsd-10.0-release/contrib/apr
/freebsd-10.0-release/contrib/apr-util
/freebsd-10.0-release/contrib/atf
/freebsd-10.0-release/contrib/binutils
/freebsd-10.0-release/contrib/bmake
/freebsd-10.0-release/contrib/byacc
/freebsd-10.0-release/contrib/bzip2
/freebsd-10.0-release/contrib/com_err
/freebsd-10.0-release/contrib/compiler-rt
/freebsd-10.0-release/contrib/dialog
/freebsd-10.0-release/contrib/dtc
/freebsd-10.0-release/contrib/ee
/freebsd-10.0-release/contrib/expat
/freebsd-10.0-release/contrib/file
/freebsd-10.0-release/contrib/gcc
/freebsd-10.0-release/contrib/gdb
/freebsd-10.0-release/contrib/gdtoa
/freebsd-10.0-release/contrib/groff
/freebsd-10.0-release/contrib/ipfilter
/freebsd-10.0-release/contrib/ipfilter/ml_ipl.c
/freebsd-10.0-release/contrib/ipfilter/mlfk_ipl.c
/freebsd-10.0-release/contrib/ipfilter/mlh_rule.c
/freebsd-10.0-release/contrib/ipfilter/mli_ipl.c
/freebsd-10.0-release/contrib/ipfilter/mln_ipl.c
/freebsd-10.0-release/contrib/ipfilter/mls_ipl.c
/freebsd-10.0-release/contrib/ldns
/freebsd-10.0-release/contrib/less
/freebsd-10.0-release/contrib/libarchive
/freebsd-10.0-release/contrib/libarchive/cpio
/freebsd-10.0-release/contrib/libarchive/libarchive
/freebsd-10.0-release/contrib/libarchive/libarchive_fe
/freebsd-10.0-release/contrib/libarchive/tar
/freebsd-10.0-release/contrib/libc++
/freebsd-10.0-release/contrib/libc-vis
/freebsd-10.0-release/contrib/libcxxrt
/freebsd-10.0-release/contrib/libexecinfo
/freebsd-10.0-release/contrib/libpcap
/freebsd-10.0-release/contrib/libstdc++
/freebsd-10.0-release/contrib/llvm
/freebsd-10.0-release/contrib/llvm/tools/clang
/freebsd-10.0-release/contrib/mtree
/freebsd-10.0-release/contrib/ncurses
/freebsd-10.0-release/contrib/netcat
/freebsd-10.0-release/contrib/ntp
/freebsd-10.0-release/contrib/nvi
/freebsd-10.0-release/contrib/one-true-awk
/freebsd-10.0-release/contrib/openbsm
/freebsd-10.0-release/contrib/openpam
/freebsd-10.0-release/contrib/openresolv
/freebsd-10.0-release/contrib/pf
/freebsd-10.0-release/contrib/sendmail
/freebsd-10.0-release/contrib/serf
/freebsd-10.0-release/contrib/smbfs
/freebsd-10.0-release/contrib/subversion
/freebsd-10.0-release/contrib/tcpdump
/freebsd-10.0-release/contrib/tcsh
/freebsd-10.0-release/contrib/tnftp
/freebsd-10.0-release/contrib/top
/freebsd-10.0-release/contrib/top/install-sh
/freebsd-10.0-release/contrib/tzcode/stdtime
/freebsd-10.0-release/contrib/tzcode/zic
/freebsd-10.0-release/contrib/tzdata
/freebsd-10.0-release/contrib/unbound
/freebsd-10.0-release/contrib/wpa
/freebsd-10.0-release/contrib/xz
/freebsd-10.0-release/crypto/heimdal
/freebsd-10.0-release/crypto/openssh
/freebsd-10.0-release/crypto/openssl
/freebsd-10.0-release/etc
/freebsd-10.0-release/etc/rc.d
/freebsd-10.0-release/gnu/lib
/freebsd-10.0-release/gnu/usr.bin/binutils
/freebsd-10.0-release/gnu/usr.bin/cc/cc_tools
/freebsd-10.0-release/gnu/usr.bin/gdb
/freebsd-10.0-release/include
/freebsd-10.0-release/lib
/freebsd-10.0-release/lib/libc
/freebsd-10.0-release/lib/libc/stdtime
/freebsd-10.0-release/lib/libc_nonshared
/freebsd-10.0-release/lib/libfetch
/freebsd-10.0-release/lib/libiconv_modules
/freebsd-10.0-release/lib/libsmb
/freebsd-10.0-release/lib/libthr
/freebsd-10.0-release/lib/libutil
/freebsd-10.0-release/lib/libvmmapi
/freebsd-10.0-release/lib/libyaml
/freebsd-10.0-release/lib/libz
/freebsd-10.0-release/release
/freebsd-10.0-release/release/doc
/freebsd-10.0-release/sbin
/freebsd-10.0-release/sbin/camcontrol
/freebsd-10.0-release/sbin/dumpon
/freebsd-10.0-release/sbin/hastd
/freebsd-10.0-release/sbin/ifconfig
/freebsd-10.0-release/sbin/ipfw
/freebsd-10.0-release/sbin/nvmecontrol
/freebsd-10.0-release/share
/freebsd-10.0-release/share/examples/bhyve
/freebsd-10.0-release/share/i18n/csmapper/JIS
/freebsd-10.0-release/share/i18n/esdb/EUC
/freebsd-10.0-release/share/man
/freebsd-10.0-release/share/man/man4
/freebsd-10.0-release/share/man/man4/bhyve.4
/freebsd-10.0-release/share/man/man5
/freebsd-10.0-release/share/man/man7
/freebsd-10.0-release/share/man/man8
/freebsd-10.0-release/share/misc
/freebsd-10.0-release/share/mk
/freebsd-10.0-release/share/mk/bsd.arch.inc.mk
/freebsd-10.0-release/share/syscons
/freebsd-10.0-release/share/zoneinfo
/freebsd-10.0-release/sys
/freebsd-10.0-release/sys/amd64/include/vmm.h
/freebsd-10.0-release/sys/amd64/include/vmm_dev.h
/freebsd-10.0-release/sys/amd64/include/vmm_instruction_emul.h
/freebsd-10.0-release/sys/amd64/include/xen
/freebsd-10.0-release/sys/amd64/vmm
/freebsd-10.0-release/sys/boot
/freebsd-10.0-release/sys/boot/i386/efi
/freebsd-10.0-release/sys/boot/ia64/efi
/freebsd-10.0-release/sys/boot/ia64/ski
/freebsd-10.0-release/sys/boot/powerpc/boot1.chrp
/freebsd-10.0-release/sys/boot/powerpc/ofw
contrib/opensolaris
/freebsd-10.0-release/sys/conf
/freebsd-10.0-release/sys/contrib/dev/acpica
/freebsd-10.0-release/sys/contrib/dev/acpica/changes.txt
/freebsd-10.0-release/sys/contrib/dev/acpica/common
/freebsd-10.0-release/sys/contrib/dev/acpica/compiler
/freebsd-10.0-release/sys/contrib/dev/acpica/components/debugger
/freebsd-10.0-release/sys/contrib/dev/acpica/components/disassembler
/freebsd-10.0-release/sys/contrib/dev/acpica/components/dispatcher
/freebsd-10.0-release/sys/contrib/dev/acpica/components/events
/freebsd-10.0-release/sys/contrib/dev/acpica/components/executer
/freebsd-10.0-release/sys/contrib/dev/acpica/components/hardware
/freebsd-10.0-release/sys/contrib/dev/acpica/components/namespace
/freebsd-10.0-release/sys/contrib/dev/acpica/components/parser
/freebsd-10.0-release/sys/contrib/dev/acpica/components/resources
/freebsd-10.0-release/sys/contrib/dev/acpica/components/tables
/freebsd-10.0-release/sys/contrib/dev/acpica/components/utilities
/freebsd-10.0-release/sys/contrib/dev/acpica/include
/freebsd-10.0-release/sys/contrib/dev/acpica/os_specific
/freebsd-10.0-release/sys/contrib/ipfilter
/freebsd-10.0-release/sys/contrib/ipfilter/netinet/ip_fil_freebsd.c
/freebsd-10.0-release/sys/contrib/ipfilter/netinet/ip_raudio_pxy.c
/freebsd-10.0-release/sys/contrib/libfdt
/freebsd-10.0-release/sys/contrib/octeon-sdk
/freebsd-10.0-release/sys/contrib/x86emu
/freebsd-10.0-release/sys/dev/bvm
/freebsd-10.0-release/sys/dev/fdt/fdt_ic_if.m
/freebsd-10.0-release/sys/dev/hyperv
/freebsd-10.0-release/sys/modules/hyperv
/freebsd-10.0-release/sys/modules/vmm
/freebsd-10.0-release/sys/x86/include/acpica_machdep.h
/freebsd-10.0-release/tools
/freebsd-10.0-release/tools/build
/freebsd-10.0-release/tools/build/options
/freebsd-10.0-release/tools/tools/atsectl
/freebsd-10.0-release/usr.bin/calendar
/freebsd-10.0-release/usr.bin/csup
/freebsd-10.0-release/usr.bin/iscsictl
/freebsd-10.0-release/usr.bin/procstat
/freebsd-10.0-release/usr.sbin
/freebsd-10.0-release/usr.sbin/bhyve
/freebsd-10.0-release/usr.sbin/bhyvectl
/freebsd-10.0-release/usr.sbin/bhyveload
/freebsd-10.0-release/usr.sbin/bsdconfig
/freebsd-10.0-release/usr.sbin/bsdinstall
/freebsd-10.0-release/usr.sbin/ctladm
/freebsd-10.0-release/usr.sbin/ctld
/freebsd-10.0-release/usr.sbin/freebsd-update
/freebsd-10.0-release/usr.sbin/jail
/freebsd-10.0-release/usr.sbin/mergemaster
/freebsd-10.0-release/usr.sbin/mount_smbfs
/freebsd-10.0-release/usr.sbin/ndiscvt
/freebsd-10.0-release/usr.sbin/pkg
/freebsd-10.0-release/usr.sbin/rtadvctl
/freebsd-10.0-release/usr.sbin/rtadvd
/freebsd-10.0-release/usr.sbin/rtsold
/freebsd-10.0-release/usr.sbin/zic
259065 07-Dec-2013 gjb

- Copy stable/10 (r259064) to releng/10.0 as part of the
10.0-RELEASE cycle.
- Update __FreeBSD_version [1]
- Set branch name to -RC1

[1] 10.0-CURRENT __FreeBSD_version value ended at '55', so
start releng/10.0 at '100' so the branch is started with
a value ending in zero.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

258595 25-Nov-2013 smh

MFC r258294:
Fix ZFS deadlock when sending a snapshot which is mounted.

Approved by: re (glebius)
Sponsored by: Multiplay


258566 25-Nov-2013 avg

MFV r258378: 4089 NULL pointer dereference in arc_read()

illumos/illumos-gate@57815f6b95a743697e148327725b7f568e75e6ea

Tested by: adrian
Approved by: re (gjb)


258565 25-Nov-2013 avg

MFV r258377: 4088 use after free in arc_release()

illumos/illumos-gate@ccc22e130479b5bd7c0002267fee1e0602d3f772

Approved by: re (gjb)


258563 25-Nov-2013 avg

MFC r258353: zfs page_busy: fix the boundaries of the cleared range

This is a fix for a regression introduced in r246293.

vm_page_clear_dirty expects the range to have DEV_BSIZE aligned boundaries,
otherwise it extends them. Thus it can happen that the whole page is
marked clean while actually having some small dirty region(s).
This commit makes the range properly aligned and ensures that only
the clean data is marked as such.

It would interesting to evaluate how much benefit clearing with DEV_BSIZE
granularity produces. Perhaps instead we should clear the whole page
when it is completely overwritten and don't bother clearing any bits
if only a portion a page is written.

Reviewed by: kib
Approved by: re (gjb)


257058 24-Oct-2013 smh

MFC r256889:

Use the vdev's ashift to calculate the supported min block size passed to
zio_compress_data(..) when compressing l2arc buffers.

This eliminates L2ARC I/O errors, which resulted in very poor performance on
vdev's configured with block size greater than 512b due to compression
assuming a smaller min block size than the vdev supports.

Approved by: re (glebius)


256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


256259 10-Oct-2013 avg

MFV r255257: 4082 zfs receive gets EFBIG from dmu_tx_hold_free()

illumos change 14172:be36a38bac3d:
illumos ZFS issues:
4082 zfs receive gets EFBIG from dmu_tx_hold_free()

Please note that this change is slightly different from r255257, because
it is merged out of order with other (larger) upstream changes.

PR: kern/182570
Reported by: Keith White <kwhite@site.uottawa.ca>
Tested by: Keith White <kwhite@site.uottawa.ca>
Approved by: re (glebius)
MFC after: 1 week
X-MFC after: r254753


256148 08-Oct-2013 markj

Initialize and free the DTrace taskqueue in the dtrace module load/unload
handlers rather than in the dtrace device open/close methods. The current
approach can cause a panic if the device is closed which the taskqueue
thread is active, or if a kernel module containing a provider is unloaded
while retained enablings are present and the dtrace device isn't opened.

Submitted by: gibbs (original version)
Reviewed by: gibbs
Approved by: re (glebius)
MFC after: 2 weeks


256132 08-Oct-2013 delphij

Improve lzjb decompress performance by reorganizing the code
to tighten the copy loop.

Submitted by: Denis Ahrens <denis h3q com>
MFC after: 2 weeks
Approved by: re (gjb)


255753 21-Sep-2013 gibbs

Optimize the block size used on ZFS cache devices as is already done
for data and log devices.

Reported by: Dmitryy Makarov
Submitted by: smh
Reviewed by: gibbs
Approved by: re (delphij)
MFC after: 2 weeks


255750 21-Sep-2013 delphij

MFV r254750:

Add support of Illumos dumps on zvol over RAID-Z.

Note that this only adds the features. FreeBSD would
still need more work to support dumping on zvols.

Illumos ZFS issues:
2932 support crash dumps to raidz, etc. pools

MFC after: 1 month
Approved by: re (ZFS blanket)


255748 20-Sep-2013 davide

Fixup cross-device rename checks in ZFS. Add a check for the case
where 'fdvp' is a directory, 'tvp' is an already existing directory
and they have different mount points.

Reported by: avg, pjd
Reviewed by: pjd
Approved by: re (rodrigc)


255437 10-Sep-2013 delphij

MFV r247844 (illumos-gate 13975:ef6409bc370f)

Illumos ZFS issues:
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states

Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.

Approved by: re (ZFS blanket)


255359 07-Sep-2013 davide

- Use make_dev_credf(MAKEDEV_REF) instead of the race-prone make_dev()+
dev_ref() in the clone handlers that still use it.
- Don't set SI_CHEAPCLONE flag, it's not used anywhere neither in devfs
(for anything real)

Reviewed by: kib


255240 05-Sep-2013 pjd

Handle cases where capability rights are not provided.

Reported by: kib


255226 05-Sep-2013 pjd

Add sysctl/tunables for various metaslab variables.


255219 05-Sep-2013 pjd

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


255161 02-Sep-2013 jhibbits

Whitespace cleanup.


255099 31-Aug-2013 jhibbits

Fixes for DTrace on PowerPC:

- Implement dtrace_getarg()
- Sync fbt with x86, and fix a typo.
- Pull in the time synchronization code from amd64.


254982 28-Aug-2013 delphij

Previously, both zfs_rename and zfs_link does a check on whether
the passed vnode belongs to the same mount point (v_vfsp or also
known as v_mount in FreeBSD). This check prevents the code from
proceeding further on vnodes that do not belong to ZFS, for
instance, on UFS or NULLFS.

The recent change (merged as r254585) on upstream changes the
check of v_vfsp to instead check the znode's z_zfsvfs. On Illumos
this would work because when the vnode comes from lofs, the
VOP_REALVP() would give the right vnode, this is not true on
FreeBSD where our VOP_REALVP is a no-op, and as such tdvp is
not guaranteed to be a ZFS vnode, and will later trigger a
failed assertion when verifying the vnode.

This changeset modifies our local shims (zfs_freebsd_rename and
zfs_freebsd_link) to check if v_mount matches before proceeding
further.

Reported by: many
Diagnostic work by: avg


254813 24-Aug-2013 markj

Rename the kld_unload event handler to kld_unload_try, and add a new
kld_unload event handler which gets invoked after a linker file has been
successfully unloaded. The kld_unload and kld_load event handlers are now
invoked with the shared linker lock held, while kld_unload_try is invoked
with the lock exclusively held.

Convert hwpmc(4) to use these event handlers instead of having
kern_kldload() and kern_kldunload() invoke hwpmc(4) hooks whenever files are
loaded or unloaded. This has no functional effect, but simplifes the linker
code somewhat.

Reviewed by: jhb


254757 24-Aug-2013 delphij

MFV r254749:

Don't hold dd_lock for long by breaking it when not doing dsl_dir
accounting. It is not necessary to hold the lock while manipulating
the parent's accounting, because there is no interface for userland
to see a consistent picture of both parent and child at the same
time anyway.

Illumos ZFS issues:
4046 dsl_dataset_t ds_dir->dd_lock is highly contended


254753 24-Aug-2013 delphij

MFV r254747:

Fix a panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive. This is a regression from FreeBSD r253821.

Illumos ZFS issues:
4047 panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive


254744 23-Aug-2013 delphij

MFV r254422:

Illumos DTrace issues:
3089 want ::typedef
3094 libctf should support removing a dynamic type
3095 libctf does not validate arrays correctly
3096 libctf does not validate function types correctly


254714 23-Aug-2013 avg

zfs: do not reject any operations on a pool just because it's a boot pool

Unlike the upstream FreeBSD supports booting to all kinds of pools.

Requested by: many
Tested by: sbruno
MFC after: 12 days


254713 23-Aug-2013 avg

fbt: drop a local write-only variable

Discovered with: gcc46
MFC after: 4 days


254711 23-Aug-2013 avg

zfs: inline and remove zfs_vnode_lock

It didn't serve any useful purpose, but obscured file and line information
useful for debugging.

MFC after: 5 days
X-MFC with: r254445


254649 22-Aug-2013 kib

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


254634 22-Aug-2013 jhibbits

Make dtrace_copy() actually work on PowerPC. Although unused currently,
it may be used in the future by dtrace.


254627 21-Aug-2013 ken

Expand the use of stat(2) flags to allow storing some Windows/DOS
and CIFS file attributes as BSD stat(2) flags.

This work is intended to be compatible with ZFS, the Solaris CIFS
server's interaction with ZFS, somewhat compatible with MacOS X,
and of course compatible with Windows.

The Windows attributes that are implemented were chosen based on
the attributes that ZFS already supports.

The summary of the flags is as follows:

UF_SYSTEM: Command line name: "system" or "usystem"
ZFS name: XAT_SYSTEM, ZFS_SYSTEM
Windows: FILE_ATTRIBUTE_SYSTEM

This flag means that the file is used by the
operating system. FreeBSD does not enforce any
special handling when this flag is set.

UF_SPARSE: Command line name: "sparse" or "usparse"
ZFS name: XAT_SPARSE, ZFS_SPARSE
Windows: FILE_ATTRIBUTE_SPARSE_FILE

This flag means that the file is sparse. Although
ZFS may modify this in some situations, there is
not generally any special handling for this flag.

UF_OFFLINE: Command line name: "offline" or "uoffline"
ZFS name: XAT_OFFLINE, ZFS_OFFLINE
Windows: FILE_ATTRIBUTE_OFFLINE

This flag means that the file has been moved to
offline storage. FreeBSD does not have any special
handling for this flag.

UF_REPARSE: Command line name: "reparse" or "ureparse"
ZFS name: XAT_REPARSE, ZFS_REPARSE
Windows: FILE_ATTRIBUTE_REPARSE_POINT

This flag means that the file is a Windows reparse
point. ZFS has special handling code for reparse
points, but we don't currently have the other
supporting infrastructure for them.

UF_HIDDEN: Command line name: "hidden" or "uhidden"
ZFS name: XAT_HIDDEN, ZFS_HIDDEN
Windows: FILE_ATTRIBUTE_HIDDEN

This flag means that the file may be excluded from
a directory listing if the application honors it.
FreeBSD has no special handling for this flag.

The name and bit definition for UF_HIDDEN are
identical to the definition in MacOS X.

UF_READONLY: Command line name: "urdonly", "rdonly", "readonly"
ZFS name: XAT_READONLY, ZFS_READONLY
Windows: FILE_ATTRIBUTE_READONLY

This flag means that the file may not written or
appended, but its attributes may be changed.

ZFS currently enforces this flag, but Illumos
developers have discussed disabling enforcement.

The behavior of this flag is different than MacOS X.
MacOS X uses UF_IMMUTABLE to represent the DOS
readonly permission, but that flag has a stronger
meaning than the semantics of DOS readonly permissions.

UF_ARCHIVE: Command line name: "uarch", "uarchive"
ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE
Windows name: FILE_ATTRIBUTE_ARCHIVE

The UF_ARCHIVED flag means that the file has changed and
needs to be archived. The meaning is same as
the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and
the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute.

msdosfs and ZFS have special handling for this flag.
i.e. they will set it when the file changes.

sys/param.h: Bump __FreeBSD_version to 1000047 for the
addition of new stat(2) flags.

chflags.1: Document the new command line flag names
(e.g. "system", "hidden") available to the
user.

ls.1: Reference chflags(1) for a list of file flags
and their meanings.

strtofflags.c: Implement the mapping between the new
command line flag names and new stat(2)
flags.

chflags.2: Document all of the new stat(2) flags, and
explain the intended behavior in a little
more detail. Explain how they map to
Windows file attributes.

Different filesystems behave differently
with respect to flags, so warn the
application developer to take care when
using them.

zfs_vnops.c: Add support for getting and setting the
UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN,
UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags.

All of these flags are implemented using
attributes that ZFS already supports, so
the on-disk format has not changed.

ZFS currently doesn't allow setting the
UF_REPARSE flag, and we don't really have
the other infrastructure to support reparse
points.

msdosfs_denode.c,
msdosfs_vnops.c: Add support for getting and setting
UF_HIDDEN, UF_SYSTEM and UF_READONLY
in MSDOSFS.

It supported SF_ARCHIVED, but this has been
changed to be UF_ARCHIVE, which has the same
semantics as the DOS archive attribute instead
of inverse semantics like SF_ARCHIVED.

After discussion with Bruce Evans, change
several things in the msdosfs behavior:

Use UF_READONLY to indicate whether a file
is writeable instead of file permissions, but
don't actually enforce it.

Refuse to change attributes on the root
directory, because it is special in FAT
filesystems, but allow most other attribute
changes on directories.

Don't set the archive attribute on a directory
when its modification time is updated.
Windows and DOS don't set the archive attribute
in that scenario, so we are now bug-for-bug
compatible.

smbfs_node.c,
smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM,
UF_READONLY and UF_ARCHIVE in SMBFS.

This is similar to changes that Apple has
made in their version of SMBFS (as of
smb-583.8, posted on opensource.apple.com),
but not quite the same.

We map SMB_FA_READONLY to UF_READONLY,
because UF_READONLY is intended to match
the semantics of the DOS readonly flag.
The MacOS X code maps both UF_IMMUTABLE
and SF_IMMUTABLE to SMB_FA_READONLY, but
the immutable flags have stronger meaning
than the DOS readonly bit.

stat.h: Add definitions for UF_SYSTEM, UF_SPARSE,
UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY
and UF_HIDDEN.

The definition of UF_HIDDEN is the same as
the MacOS X definition.

Add commented-out definitions of
UF_COMPRESSED and UF_TRACKED. They are
defined in MacOS X (as of 10.8.2), but we
do not implement them (yet).

ufs_vnops.c: Add support for getting and setting
UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY,
UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS.
Alphabetize the flags that are supported.

These new flags are only stored, UFS does
not take any action if the flag is set.

Sponsored by: Spectra Logic
Reviewed by: bde (earlier version)


254608 21-Aug-2013 gibbs

Add kstat entries for ZFS compression statistics.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
Add module lifetime functions to allocate and teardown
state data.

Report:
- Compression attempts.
- Buffers found to be empty.
- Compression calls that are skipped because
the data length is already less than or
equal to the minimum block length.
- Compression attempts that fail to yield a 12.5%
compression ratio.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:
Add calls to the zio_compress.c module's init and fini
functions.

Sponosred by: Spectra Logic Corporation
MFC after: 2 weeks


254591 21-Aug-2013 gibbs

Enhance the ZFS vdev layer to maintain both a logical and a physical
minimum allocation size for devices. Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.

Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.

Calculate the minimum blocksize of each metaslab class. Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space
usage.

Report devices with sub-optimal block size configuration in "zpool
status". Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.

Sponsored by: Spectra Logic Corporaion
MFC after: 2 weeks

Background
==========
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

1) Existing pools created with devices that have different logical
and physical block sizes, but were configured to use the logical
block size (e.g. because the OS version used for pool construction
reported the logical block size instead of the physical block
size) will suddenly find that the vdev allocation size has
increased. This can be easily tolerated for active members of
the array, but ZFS would prevent replacement of a vdev with
another identical device because it now appears that the smaller
allocation size required by the pool is not supported by the new
device.

2) The device's physical block size may be too large to be supported
by ZFS. The optimal allocation size for the vdev may be quite
large. For example, a RAID controller may export a vdev that
requires read-modify-write cycles unless accessed using 64k
aligned/sized requests. ZFS currently has an 8k minimum block
size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
Add the SPA_ASHIFT constant. ZFS currently has a hard upper
limit of 13 (8k) for ashift and this constant is used to
both document and enforce this limit.

sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h:
Add the VDEV_AUX_ASHIFT_TOO_BIG error code.

Add fields for exporting the configured, logical, and
physical ashift to the vdev_stat_t structure.

Add VDEV_STAT_VALID() macro which can be used to verify the
presence of required vdev_stat_t fields in nvlist data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
Provide a SYSCTL_PROC handler for "max_auto_ashift". Since
the limit is only referenced long after boot when a create
operation occurs, there's no compelling need for it to be
a boot time configurable tunable. This also allows the
validation code for the max_auto_ashift value to be contained
within the sysctl handler.

Populate the new fields in the vdev_stat_t structure.

Fail vdev opens if the vdev reports an ashift larger than
SPA_MAXASHIFT.

Propogate vdev_logical_ashift and vdev_physical_ashift between
child and parent vdevs as is done for vdev_ashift.

In vdev_open(), restore code that fails opens for devices
where vdev_ashift grows. This can only happen now if the
device's logical ashift grows, which means it really isn't
safe to use the device.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:
Update the vdev_open() API so that both logical (what was
just ashift before) and physical ashift are reported.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
to vdev_t.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
Add vdev_ashift_optimize(). Call it anytime a new top-level
vdev is allocated.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.

For each sub-optimally configured leaf vdev, report configured
and native block sizes.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
This status is reported on healthy pools containing vdevs
configured to use a block size smaller than their reported
physical block size.

cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Update find_vdev_problem() and supporting functions to
provide the full vdev_stat_t structure to problem checking
routines, and to allow decent into replacing vdevs.

Add a vdev_non_native_ashift() validator which is used on
the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.

cddl/contrib/opensolaris/lib/libzpool/common/kernel.c:
cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h:
Enhance sysctl userland stubs now that a SYSCTL_PROC handler
is used in vdev.c.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h:
When the group membership of a metaslab class changes (i.e.
when a vdev is added or removed from a pool), walk the group
list to determine the smallest block size currently available
and record this in the metaslab class.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
Add the metaslab_class_get_minblocksize() accessor.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In zio_compress_data(), take the minimum blocksize as an
input parameter instead of assuming SPA_MINBLOCKSIZE.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
blocksize of the device. The l2arc code performs has it's own
code for deciding if compression is worth while, so this
effectively disables zio_compress_data() from second guessing
the original decision.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:
In zio_write_bp_init(), use the minimum blocksize of the
normal metaslab class when compressing data.


254587 21-Aug-2013 delphij

MFV r254421:

Illumos ZFS issues:
3996 want a libzfs_core API to rollback to latest snapshot


254585 20-Aug-2013 delphij

MFV r254220:

Illumos ZFS issues:
4039 zfs_rename()/zfs_link() needs stronger test for XDEV


254509 19-Aug-2013 jhibbits

Fix some ppc64 dtrace bugs, and enable systrace_freebsd32 for ppc64.


254468 17-Aug-2013 markj

Add a "translated type" argument to SDT_PROBE_ARGTYPE() and add some macros
which allow one to define SDT probes that specify translated types. The idea
is to make it easy to write SDT probe definitions that can work across
multiple operating systems. In particular, this makes it possible to port
illumos SDT probes to FreeBSD without changing their argument types, so long
as the appropriate translators are defined. Then DTrace scripts written for
Solaris/illumos will work on FreeBSD without any changes.

MFC after: 1 week


254445 17-Aug-2013 pjd

Remove redundant variable.


254309 14-Aug-2013 markj

Use kld_{load,unload} instead of mod_{load,unload} for the linker file load
and unload event handlers added in r254266.

Reported by: jhb
X-MFC with: r254266


254268 13-Aug-2013 markj

FreeBSD's DTrace implementation has a few problems with respect to handling
probes declared in a kernel module when that module is unloaded. In
particular,

* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
trying to use them after the fact will generally cause a panic.

This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).

Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.

PR: 166927, 166926, 166928
Reported by: davide [1]
Reviewed by: avg, trociny (earlier version)
MFC after: 1 month


254198 11-Aug-2013 rpaulo

fasttrap_fork(): unlock the processes before removing the tracepoints.

In the future, we'll need to come up with new proc_*() functions that accept
locked processes. For now, this prevents postgresql + DTrace from crashing the
system.

MFC after: 1 month


254138 09-Aug-2013 attilio

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


254112 08-Aug-2013 delphij

MFV r254079:

Illumos ZFS issues:
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if
physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting


254077 07-Aug-2013 delphij

MFV r254071:

Fix a regression introduced by fix for Illumos bug #3834. Quote from
Matthew Ahrens on the Illumos issue:

ztest fails this assertion because ztest_dmu_read_write() does
dmu_tx_hold_free(tx, bigobj, bigoff, bigsize);
and then
dmu_object_set_checksum(os, bigobj,
(enum zio_checksum)ztest_random_dsl_prop(ZFS_PROP_CHECKSUM), tx);

If the region to free is past the end of the file, the DMU assumes that there
will be nothing to do for this object. However, ztest does set_checksum(),
which must modify the dnode. The fix is for ztest to also call

dmu_tx_hold_bonus(tx, bigobj);

so we can account for the dirty data associated with setting the checksum

Illumos ZFS issues:
3955 ztest failure: assertion refcount_count(&tx->tx_space_written)
+ delta <= tx->tx_space_towrite


254074 07-Aug-2013 delphij

MFV r254070:

Merge vendor bugfix for ZFS test suite that triggers false positives.

Illumos ZFS issues:
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together


254025 07-Aug-2013 jeff

Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


254014 06-Aug-2013 delphij

MFV r254013 (dummy merge to note that the change is already
merged):

Illumos ZFS issues:
3973 zfs_ioc_rename alters passed in zc->zc_name


254012 06-Aug-2013 delphij

MFV r254011:

This change have no effect to FreeBSD but integrated for
completeness.

Illumos ZFS issues:
348 ZFS should handle DKIOCGMEDIAINFOEXT failure


253996 06-Aug-2013 avg

opensolaris code: translate INVARIANTS to DEBUG and ZFS_DEBUG

Do this by forcing inclusion of
sys/cddl/compat/opensolaris/sys/debug_compat.h
via -include option into all source files from OpenSolaris.
Note that this -include option must always be after -include opt_global.h.

Additionally, remove forced definition of DEBUG for some modules and fix
their build without DEBUG.

Also, meaning of DEBUG was overloaded to enable WITNESS support for some
OpenSolaris (primarily ZFS) locks. Now this overloading is removed and
that use of DEBUG is replaced with a new option OPENSOLARIS_WITNESS.

MFC after: 17 days


253993 06-Aug-2013 mav

Block reporting of ZFS features for suspended pools.

Before executing any subcommand, zpool tool fetches pools configuration from
the kernel. Before features support was added, kernel was regenerating that
configuration based on data always present in memory. Unfortunately, pool
features list and activity counters are not such. They are stored in ZAP,
that normally resides in ARC, but under heavy memory pressure may be swapped
out. If pool is suspended at this point, there is no way to recover it back
since any zpool command will stuck.

This change has one predictable flaw: `zpool upgrade` always wish to upgrade
suspended pools, but fortunately it can't do it due to the suspension.


253992 06-Aug-2013 mav

Disable r252840 when ZFS TRIM is enabled (vfs.zfs.trim.enabled=1) and really
disable TRIM otherwise.

r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has
no lock dependencies and should complete immediately. Unfortunately, with our
TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added
to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still
acquires SCL_ZIO lock for read to be sure devices won't disappear.

When TRIM is disabled, this patch enables direct free execution from r252840
and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from
the pipeline to avoid lock acquisition. Otherwise it queues free request as
it was before r252840.


253991 06-Aug-2013 mav

Make `zpool clear` to reopen also reconnected cache and spare devices.
Since `zpool status` reports about such kinds of errors, it is strange
that they are not cleared by `zpool clear`.


253990 06-Aug-2013 mav

Make ZFS to use separate thread to handle SPA_ASYNC_REMOVE async events.
Existing async thread is running only on successfull spa_sync() completion,
that is impossible in case of pool loosing required (last) disk(s). That
indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the
lost disks, preventing GEOM/CAM from destroying devices and reusing names
on later disk reattach.

In earlier version of the patch I've tried to just run existing thread
immediately, unrelated to spa_sync() completion, but that exposed number
of situations where it could stuck due to locks held by stuck spa_sync(),
that are required for other kinds of async events.

Experiments with OpenIndiana snapshot confirmed that they also have this
issue with lost disks reattach.


253953 05-Aug-2013 attilio

Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by: EMC / Isilon storage division
Reported by: kib


253939 04-Aug-2013 attilio

The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff
Tested by: pho


253926 04-Aug-2013 smh

zfs_ioc_rename should not leave the value of zc_name passed in via zc altered
on return.

MFC after: 1 week


253821 30-Jul-2013 delphij

MFV r253783:

Skip eviction step of processing free records when doing ZFS
receive to avoid the expensive search operation of non-existent
dbufs in dn_dbufs.

Illumos ZFS issues:
3834 incremental replication of 'holey' file systems is slow

MFC after: 2 weeks


253820 30-Jul-2013 delphij

MFV r253782:

To quote Illumos issue #3888:

When 'zfs recv -F' is used with an incremental recv it rolls
back any changes made since the last snapshot in case new
changes were made to the file system while the recv is in
progress (without -F the recv would fail when it does it's
final check to commit the recv-ed data as the recv-ed data
conflicts with the newly written data).

However, if there is a snapshot taken after the recv began
rolling back to the 'latest' snapshot will not help and the
recv will still fail. 'zfs recv -F' should be extended to
destroy any snapshots created since the source snapshot when
finishing the recv (effectively rolling back through all
snapshots, instead of just to the latest snapshot).

Illumos ZFS issues:
3888 zfs recv -F should destroy any snapshots created since the
incremental source

MFC after: 2 weeks


253819 30-Jul-2013 delphij

MFV r253781 + r253871:

Illumos ZFS issues:
3894 zfs should not allow snapshot of inconsistent dataset

MFC after: 2 weeks


253816 30-Jul-2013 delphij

MFV r253780:

To quote Illumos #3875:

The problem here is that if we ever end up in the error
path, we drop the locks protecting access to the zfsvfs_t
prior to forcibly unmounting the filesystem. Because z_os
is NULL, any thread that had already picked up the zfsvfs_t
and was sitting in ZFS_ENTER() when we dropped our locks
in zfs_resume_fs() will now acquire the lock, attempt to
use z_os, and panic.

Illumos ZFS issues:
3875 panic in zfs_root() after failed rollback

MFC after: 2 weeks


253806 30-Jul-2013 mav

Allow three IOCTLs to be used on suspended pool, restoring state that
existed before IOCTL code refactoring merged change 4445fffb from illumos
at r248571.

This change allows `zpool clear` to be used again to recover suspended pool.
It seems the only was supposed by the code to restore pool operation after
reconnecting lost disks that were required for data completeness. There
are still cases where `zpool clear` command can just safely stuck due to
deadlocks inside ZFS kernel part, but probably that is better then having
no chances to recover at all.


253772 29-Jul-2013 avg

dtrace disassembler: take the latest/last CDDL code from OpenSolaris

OpenSolaris version is:
13108:33bb8a0301ab
6762020 Disassembly support for Intel Advanced Vector Extensions (AVX)

This corresponds to Illumos-gate (github) version
ab47273fedff893c8ae22ec39ffc666d4fa6fc8b

MFC after: 3 weeks


253754 28-Jul-2013 mav

Partially close race between calls of orphan() method from GEOM and close()
method from ZFS core, that reliably causes use-after-free panic if SSD vdev
detached during inititial erase.


253643 25-Jul-2013 mav

Following r222950, revert unintentional change cls -> class in argument name
in r245264. Aside from non-uniformity, that again confused C++ compilers.


253606 24-Jul-2013 avg

zfs module: perform cleanup during shutdown in addition to module unload

- move init and fini code into separate functions (like it is done upstream)
- invoke fini code via shutdown_post_sync event hook

This should make zfs close its underlying devices during shutdown,
which may be important for their drivers.

MFC after: 20 days


253603 24-Jul-2013 avg

zfs: move vnode creation from zfs_znode_cache_constructor to zfs_znode_alloc

All other places where a znode is allocated do not need z_vnode at all.
These are:
- zfs_create_share_dir
- zfs_create_fs

This chnage ensures two things:
- VN_LOCK_ASHARE is not erroneously called for VFIFO vnodes
- vn_lock is called on a fully constructed vnode with correct v_ops

The change also allows to make zfs_znode_cache_constructor a normal
kmem_cache constructor again (as it is in upstream).
This allows to avoid a problem where zfs_znode_cache_destructor
may be called on un-constructed znodes.

MFC after: 17 days


253441 18-Jul-2013 delphij

Manually merge part of vendor import r238583 from Illumos.

Illumos changeset: 13680:2bd022a765e2
Illumos ZFS issue:

2671 zpool import should not fail if vdev ashift has increased

MFC after: 3 days


253079 09-Jul-2013 avg

dtrace/fasttrap: install hook functions only after all data is
initialized

Sponsored by: HybridCluster
MFC after: 7 days


253073 09-Jul-2013 avg

zfs: try to properly handle i/o errors in mappedread_sf

Unconditionally freeing a page is not good, especially if it is the page
that was wired by the caller. The checks are picked up from
kern_sendfile.

MFC after: 3 weeks


253070 09-Jul-2013 avg

zfs: load zpool.cache after a root fs is mounted

MFC after: 3 weeks


252850 05-Jul-2013 markj

Hide references to mod_lock. In FreeBSD it is always acquired with the
provider lock held, so its use has no effect.


252840 05-Jul-2013 mm

MFV r252839:

Quoting illumos issue #3836:
Currently zio_free() always puts the zio on a list for subsequent
processing by zio_free_sync(). This is only necessary for frees that
might need to issue reads (gang and dedup blocks).

By processing the majority of the frees as we encounter them, we reduce
the amount of time that the spa_sync() thread spends burning CPU and
not doing any i/o, thus increasing the overall write throughput of the
system.

Illumos ZFS issues:
3836 zio_free() can be processed immediately in the common case

MFC after: 1 week


252493 01-Jul-2013 markj

Be sure to destory the fasttrap cleanup mutex when unloading the fasttrap
module. This should be MFCed with r250953.


252431 30-Jun-2013 rmh

Enable kernel-specific code for FreeBSD also on other systems that use
the kernel of FreeBSD.

Reviewed by: pjd


252392 29-Jun-2013 smh

style(9) fixes

MFC after: 2 days


252390 29-Jun-2013 smh

Remove invalid ASSERT which causes a panic on zfs renames when run with ASSERTS.
Removal was missed in merge of illumos 3464 (r248571)

MFC after: 2 days


252380 29-Jun-2013 mm

Unbreak "zfs jail" and "zfs unjail" (broken since r248571)

I missed to register zfs_ioc_jail and zfs_ioc_unjail as legacy ioctl's
with the new zfs_ioctl_register_legacy() function.

These operations do not modify pools or datasets so there is no need to
log them to pool history.

Reported by: Alexander Leidinger <ale@FreeBSD.org> and others on current@
MFC after: 3 days


252337 28-Jun-2013 gavin

Don't try to re-insert an already present but invalid page.

This could happen if a thread doing a page-in loses a ZFS range lock
race to a thread writing to the same range

This fixes "panic: vm_page_alloc: pindex already allocated" in
http://docs.FreeBSD.org/cgi/mid.cgi?1372165971.96049.42.camel

Submitted by: avg
MFC after: 1 week


252325 28-Jun-2013 markj

The dtmalloc provider uses the short description of a malloc type as the
function name of its corresponding DTrace probes. These descriptions may
contain whitespace, but probe names cannot, so just replace any whitespace
with underscores when creating probes.

MFC after: 1 week


252219 25-Jun-2013 delphij

MFV r252215:

Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).

Illumos ZFS issues:
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl

MFC after: 1 week


252061 21-Jun-2013 smh

Switch ZFS mutex_owner macro to use sx_xholder as its now exported via sx.h

MFC after: 1 week


252060 21-Jun-2013 smh

Fix intermittent ZFS lock panic when kernel is compiled with debugging caused
by access of uninitialized smlock in mmutex_init.

MFC after: 1 week


252056 21-Jun-2013 smh

Fixed import of destroyed ZFS pools failing due to vdev_geom incorrectly
preventing config loads from devices associated with destroyed pools.

Reviewed by: avg
MFC after: 1 week


251646 12-Jun-2013 delphij

MFV r251644:

Poor ZFS send / receive performance due to snapshot
hold / release processing (by smh@)

Illumos ZFS issues:
3740 Poor ZFS send / receive performance due to snapshot
hold / release processing

MFC after: 2 weeks


251636 11-Jun-2013 delphij

MFV r251626:

ZFS event processing should work on R/O root filesystems

Illumos ZFS issues:
3749 zfs event processing should work on R/O root filesystems

MFC after: 2 weeks


251635 11-Jun-2013 delphij

MFV r251624:

txg commit callbacks don't work

Illumos ZFS issues:
3747 txg commit callbacks don't work

MFC after: 2 weeks


251633 11-Jun-2013 delphij

MFV r251622:

ZFS shouldn't ignore errors unmounting snapshots

Illumos ZFS issues:
3744 zfs shouldn't ignore errors unmounting snapshots

MFC after: 2 weeks


251632 11-Jun-2013 delphij

MFV r251621:

ZFS needs a refcount audit

Illumos ZFS issues:
3741 zfs needs a refcount audit

MFC after: 2 weeks


251631 11-Jun-2013 delphij

MFV r251620:

ZFS comments need cleaner, more consistent style

Illumos ZFS issues:
3741 zfs comments need cleaner, more consistent style

MFC after: 2 weeks


251629 11-Jun-2013 delphij

MFV r251619:

ZFS needs better comments.

Illumos ZFS issues:
3741 zfs needs better comments

MFC after: 2 weeks


251520 08-Jun-2013 delphij

MFV r251519:

* Illumos ZFS issue #3805 arc shouldn't cache freed blocks

Quote from the Illumos issue:

ZFS should proactively evict freed blocks from the cache.

Even though these freed blocks will never be used again, and thus
will eventually be evicted, this causes us to use memory
inefficiently for 2 reasons:

1. A block that is freed has no chance of being accessed again, but
will be kept in memory preferentially to a block that was accessed
before it (and is thus older) but has not been freed and thus has
at least some chance of being accessed again.

2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)

The user data vs metadata split is somewhat arbitrary, and the
primary control on how much memory is used to cache data vs metadata
is to simply try to keep the proportion the same as it has been in the
past (each bucket "evicts against" itself). The secondary control is
to evict data before evicting metadata.

Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has
more recently accessed, still-allocated blocks. Data in the useful
bucket (with still-allocated blocks) may be evicted in preference to
data in the useless bucket (with old, freed blocks).

On dcenter, we saw that the MFU metadata bucket was 230MB, while the
MFU data bucket was 27GB and the MRU metadata bucket was 256GB.
However, the vast majority of data in the MRU metadata bucket (256GB)
was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket
(230MB) was constantly evicting useful blocks that will be soon needed.

The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.

MFC after: 2 weeks


251478 06-Jun-2013 delphij

MFV r251474:

* Illumos zfs issue #3137 L2ARC compression

Whether or not to compress buffers entering the L2ARC is
controlled by "compression" setting on the dataset, when
compression is not "off", L2ARC compression is enabled.

The compress method is always LZ4 for L2ARC when enabled
because it works best for the scenario.

MFC after: 2 weeks


251238 02-Jun-2013 markj

SDT probes can directly pass up to five arguments as arguments to
dtrace_probe(). Arguments beyond these five must be obtained in an
architecture-specific way; this can be done through the getargval provider
method, and through dtrace_getarg() if getargval isn't overridden.

This change fixes two off-by-one bugs in the way these arguments are fetched
in FreeBSD's DTrace implementation. First, the SDT provider must set the
aframes parameter to 1 when creating a probe. The aframes parameter controls
the number of frames that dtrace_getarg() will step over in order to find
the frame containing the extra arguments. On FreeBSD, dtrace_getarg() is
called in SDT probe context via

dtrace_probe()->dtrace_dif_emulate()->dtrace_dif_variable->dtrace_getarg()

so aframes must be 3 since the arguments are in dtrace_probe()'s frame; it
was previously being called with a value of 2 instead. illumos uses a
different aframes value for SDT probes, but this is because illumos SDT
probes fire by triggering the #UD fault handler rather than calling
dtrace_probe() directly.

The second bug has to do with the way arguments are grabbed out
dtrace_probe()'s frame on amd64. The code currently jumps over the first
stack argument and retrieves the rest of them using a pointer into the
stack. This works on i386 because all of dtrace_probe()'s arguments will be
on the stack and the first argument is the probe ID, which should be
ignored. However, it is incorrect to ignore the first stack argument on
amd64, so we correct the pointer used to access the arguments.

MFC after: 2 weeks


251237 02-Jun-2013 markj

Port the SDT test now that it's possible to create SDT probes that take
seven arguments.

The original test uses Solaris' uadmin system call to trigger the test
probe; this change adds a sysctl to the dtrace_test module and gets the test
program to trigger the test probe via the sysctl handler.

The test is currently failing on amd64 because of some bugs in the way that
probe arguments beyond the first five are obtained - these bugs will be
fixed in a separate change.


250953 24-May-2013 markj

The fasttrap provider cleans up probes asynchronously when a process with
USDT probes exits. This was previously done with a callout; however, it is
possible to sleep while holding the DTrace mutexes, so a panic will occur
on INVARIANTS kernels if the callout handler can't immediately acquire one
of these mutexes. This panic will be frequently triggered on systems where
a USDT-enabled program (perl, for instance) is often run.

This revision changes the fasttrap cleanup mechanism so that a dedicated
thread is used instead of a callout. The old behaviour is otherwise
preserved.

Reviewed by: rpaulo
MFC after: 1 month


250574 12-May-2013 markj

Bring back part of r249367 by adding DTrace's temporal option, which allows
users to guarantee that the output of DTrace scripts will be time-ordered.
This option is enabled by adding the line

#pragma D option temporal

to the beginning of a script, or by adding '-x temporal' to the arguments of
dtrace(1).

This change fixes a bug in the original port of the temporal option. This
bug was causing some assertions to fail, so they had been disabled; in this
revision the assertions are working properly and are enabled.

The DTrace version number has been bumped from 1.9.0 to 1.9.1 to reflect
the language change that's being introduced.

This change corresponds to part of illumos-gate commit e5803b76927480:
3021 option for time-ordered output from dtrace(1M)

Reviewed by: pfg
Obtained from: illumos
MFC after: 1 month


250149 01-May-2013 davide

In case ZFS doesn't use UMA for buffers there's no need to waste memory
creating zones that will remain empty.

Reviewed by: pjd


249921 26-Apr-2013 smh

Changed ZFS TRIM sysctl from vfs.zfs.trim_disable -> vfs.zfs.trim.enabled
Enabled ZFS TRIM by default

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


249858 24-Apr-2013 mm

MFV r249857:

Merge vendor bugfix for a possible deadlock related to async destroy
and improve write performance by introducing a new lock protecting
tx_open_txg.

Illumos ZFS issues:
3642 dsl_scan_active() should not issue I/O to determine if async
destroying is active
3643 txg_delay should not hold the tc_lock

MFC after: 1 week


249787 23-Apr-2013 mm

The zfs synctask code restructuring introduced a new bug that makes it
impossible to set quota and reservation on pools lower than version 22.
Problem has been reported and a solution discussed with vendor.

Illumos ZFS issues:
3739 cannot set zfs quota or reservation on pool version < 22

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reported by: Steve Wills <swills@FreeBSD.org>
MFC after: 3 days


249573 17-Apr-2013 pfg

DTrace: Revert r249367

The following change from illumos brought caused DTrace to
pause in an interactive environment:

3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This was not detected during testing because it doesn't
affect scripts.

We shouldn't be changing the environment, especially since the
LD_NOLAZYLOAD option doesn't apply to our (GNU) ld.
Unfortunately the change from upstream was made in such a way
that it is very difficult to separate this change from the
others so, at least for now, it's better to just revert
everything.

Reference:
https://www.illumos.org/issues/3026

Reported by: Navdeep Parhar and Mark Johnston


249367 11-Apr-2013 pfg

DTrace: option for time-ordered output

Merge changes from illumos:

3021 option for time-ordered output from dtrace(1M)
3022 DTrace: keys should not affect the sort order when sorting by value
3023 it should be possible to dereference dynamic variables
3024 D integer narrowing needs some work
3025 register leak in D code generation
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This brings yet another feature implemented in upstream DTrace.
A complete description is available here:
http://dtrace.org/blogs/ahl/2012/07/28/my-new-dtrace-favorite/

This change bumps the DT_VERS_* number to 1.9.1 in
accordance to what is done in illumos.

This change was somewhat complicated because upstream is mixed many
changes in an individual commit and some of the tests don't really
apply to us.

There are also appear to be differences in timestamping with Solaris
so we had to workaround some assertions making sure no regression
happened.

Special thanks to Fabian Keil for changes and testing.

Illumos Revisions: 13758:23432da34147

Reference:
https://www.illumos.org/issues/3021
https://www.illumos.org/issues/3022
https://www.illumos.org/issues/3023
https://www.illumos.org/issues/3024
https://www.illumos.org/issues/3025
https://www.illumos.org/issues/1694

Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 months


249356 11-Apr-2013 mm

MFV r249354:
Merge bugfixes accepted and integrated by vendor. Underlying problems
have been reported by us and fixed in r240942 and r249196.

Illumos ZFS issues:
3645 dmu_send_impl: possibilty of pool hold leak
3692 Panic on zfs receive of a recursive deduplicated stream

MFC after: 8 days


249326 10-Apr-2013 mm

Cast to (void *)(uintptr_t) on copyout and copyin of zfs_iocparm_t.zfs_cmd

MFC after: 9 days


249319 09-Apr-2013 mm

ZFS expects a copyout of zfs_cmd_t on an ioctl error. Our sys_ioctl()
doesn't copyout in this case.

To solve this issue a new struct zfs_iocparm_t is introduced consisting of:
- zfs_ioctl_version (future backwards compatibility purposes)
- user space pointer to zfs_cmd_t (copyin and copyout)
- size of zfs_cmd_t (verification purposes)

The copyin and copyout of zfs_cmd_t is now done the illumos (vendor) way
what makes porting of new changes easier and ensures correct behavior if
returning an error.

MFC after: 10 days


249209 06-Apr-2013 mm

MFV r249186:
Do not list read-only pools in zpool.cache
Reduce diff against vendor in unused vdev_disk.c

Illumos ZFS issues:
3639 zpool.cache should skip over readonly pools
3640 want automatic devid updates

MFC after: 1 week


249206 06-Apr-2013 mm

MFV r248660:
Merge vendor change - modify time processing in deadman thread.

Illumos ZFS issues:
3618 ::zio dcmd does not show timestamp data

MFC after: 3 weeks


249196 06-Apr-2013 mm

Provide a fix for kernel panic if receiving recursive deduplicated streams.
Problem reported to vendor.

Illumos ZFS issues:
3692 Panic on zfs receive of a recursive deduplicated stream

MFC after: 2 weeks


249195 06-Apr-2013 mm

MFV r248217:
Merge change from vendor to reduce diff only.
ZFS dtrace probes are not supported on FreeBSD yet.

Illumos ZFS issues:
3598 want to dtrace when errors are generated in zfs

MFC after: 3 weeks


249188 06-Apr-2013 mm

MFV r242816:
Import vendor change to reduce diff, no effect on FreeBSD.

Illumos ZFS issues:
3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd


249047 03-Apr-2013 avg

spa_open_common: fix argument to zvol_create_minors

Prior to r248571 spa_open was always called with a bare pool name,
but now it is called with a dataset name instead (spa_lookup handles
that).
So, when a ZFS root is mounted spa_open is called with a name of a root
dataset, which can very well be different from the pool name.
But zvol_create_minors should be called with the pool name, because it
performs a recursive traversal of all datasets under the name to find
all those that are volumes.

MFC after: 7 days


249042 03-Apr-2013 mm

Fix possible pool hold leak in dmu_send_impl()

Problem reported to vendor:
https://www.illumos.org/issues/3645

Reported by: Andriy Gapon <avg@FreeBSD.org>
MFC after: 15 days


249004 02-Apr-2013 mm

Do not check against uninitialized rc and comment out vendor code

MFC after: 16 days


248983 01-Apr-2013 pfg

Dtrace: enablings on defunct providers prevent providers from unregistering

Merge change from illumos:

1368 enablings on defunct providers prevent providers from unregistering

We try to address some underlying differences between the Solaris
and FreeBSD implementations: dtrace_attach() / dtrace_detach() are
currently unimplemented in FreeBSD but the new code from illumos
makes use of taskq so some adaptations were made to dtrace_open()
and dtrace_close() to handle them appropriately.

Illumos Revision: r13430:8e6add739e38

Reference:
https://www.illumos.org/issues/1368

Reviewed by: gnn
Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 3 weeks


248976 01-Apr-2013 mm

Call dmu_snapshot_list_next() in zvol.c with dsl_pool_config lock held

Submitted by: Andriy Gapon <avg@FreeBSD.org>
MFC after: 17 days


248752 26-Mar-2013 pfg

Dtrace: dtrace.c erroneously checks for memory alignment on amd64.

Merge change from illumos:

3511 dtrace.c erroneously checks for memory alignment on amd64

Illumos Revision: c93cc65

Reference:
https://www.illumos.org/issues/3511

Obtained from: Illumos
MFC after: 3 weeks


248708 25-Mar-2013 pfg

Dtrace: Add SUN MDB-like type-aware print() action.

Merge change from illumos:

1694 Add type-aware print() action

This is a very nice feature implemented in upstream Dtrace.
A complete description is available here:
http://dtrace.org/blogs/eschrock/2011/10/26/your-mdb-fell-into-my-dtrace/

This change bumps the DT_VERS_* number to 1.9.0 in
accordance to what is done in illumos.

While here also include some minor cleanups to ease further merging
and appease clang with a fix by Fabian Keil.

Illumos Revisions: 13501:c3a7090dbc16
13483:f413e6c5d297

Reference:
https://www.illumos.org/issues/1560
https://www.illumos.org/issues/1694

Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 month


248706 25-Mar-2013 pfg

Dtrace: add toupper()/tolower() and enhancements to lltostr().

Merge changes from illumos:

1451 DTrace needs toupper()/tolower() subroutines
1457 lltostr() D subroutine should take an optional base

This change bumps the DT_VERS_* number to 1.8.1 in
accordance to what is done in illumos.

The test suite we currently include is outdated and
doesnt support some updates in tst.subr.d which had to
be left out for now.

Illumos Revisions: r13458 5e394d8db762
r13459 c3454574dd1a

Reference:
https://www.illumos.org/issues/1451
https://www.illumos.org/issues/1457

Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 month


248690 24-Mar-2013 pfg

Dtrace: add optional size argument to tracemem().

Merge change from illumos:

1455 DTrace tracemem() should take an optional size argument

Our local enhancements to dt_print_bytes were equivalent to
those in illumos but we made it match the illumos version
to ease further code merges.

For now leave out tst.smallsize.d and tst.smallsize.d.out
since those don't seem to work cleanly on FreeBSD.

This change bumps the DT_VERS_* number to 1.7.1 in accordance
to what is done in illumos.

Illumos Revision: 13457:571b0355c2e3

Reference:
https://www.illumos.org/issues/1455

Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 month


248653 23-Mar-2013 will

ZFS: Fix a panic while unmounting a busy filesystem.

This particular scenario was easily reproduced using a NFS export. When the
first 'zfs unmount' occurred, it returned EBUSY via this path, while
vflush() had flushed references on the filesystem's root vnode, which in
turn caused its v_interlock to be destroyed. The next time 'zfs unmount'
was called, vflush() tried to obtain this lock, which caused this panic.

Since vflush() on FreeBSD is a definitive call, there is no need to check
vfsp->vfs_count after it completes. Simply #ifdef sun this check.

Submitted by: avg
Reviewed by: avg
Approved by: ken (mentor)
MFC after: 1 month


248642 23-Mar-2013 avg

fbt_getargdesc: correctly handle types for return probes

MFC after: 6 days


248640 23-Mar-2013 avg

fbt_typoff_init: fix an off by one in determining required memory size

This issue would be silent most of the time, but if the requested memory
is a multiple of a page size, then accessing one element beyond the end
would lead to a kernel page fault.
Otherwise, the unlucky last type would just be inaccessible.

Reported by: glebius
Tested by: glebius
MFC after: 6 days


248602 21-Mar-2013 smh

Fix for building libzpool under i386.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


248579 21-Mar-2013 smh

Add missing descriptions for ZFS sysctls

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


248577 21-Mar-2013 smh

Optimisation of TRIM processing.

Previously TRIM processing was very bursty. This was made worse by the fact
that TRIM requests on SSD's are typically much slower than reads or writes.
This often resulted in stalls while large numbers of TRIM's where processed.

In addition due to the way the TRIM thread was only woken by writes, deletes
could stall in the queue for extensive periods of time.

This patch adds a number of controls to how often the TRIM thread for each
SPA processes its outstanding delete requests.
vfs.zfs.trim.timeout: Delay TRIMs by up to this many seconds
vfs.zfs.trim.txg_delay: Delay TRIMs by up to this many TXGs (reduced to 32)
vfs.zfs.vdev.trim_max_bytes: Maximum pending TRIM bytes for a vdev
vfs.zfs.vdev.trim_max_pending: Maximum pending TRIM segments for a vdev
vfs.zfs.trim.max_interval: Maximum interval between TRIM queue processing
(seconds)

Given the most common TRIM implementation is ATA TRIM the current defaults
are targeted at that.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


248576 21-Mar-2013 smh

Names the ZFS TRIM thread

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


248575 21-Mar-2013 smh

TRIM cache devices based on time instead of TXGs.
Currently, the trim module uses the same algorithm for data and cache
devices when deciding to issue TRIM requests, based on how far in the
past the TXG is.

Unfortunately, this is not ideal for cache devices, because the L2ARC
doesn't use the concept of TXGs at all. In fact, when using a pool for
reading only, the L2ARC is written but the TXG counter doesn't
increase, and so no new TRIM requests are issued to the cache device.

This patch fixes the issue by using time instead of the TXG number as
the criteria for trimming on cache devices. The basic delay principle
stays the same, but parameters are expressed in seconds instead of
TXGs. The new parameters are named trim_l2arc_limit and
trim_l2arc_batch, and both default to 30 second.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: https://github.com/dechamps/zfs/commit/17122c31ac7f82875e837019205c21651c05f8cd
MFC after: 2 weeks


248574 21-Mar-2013 smh

Improve TXG handling in the TRIM module.
This patch adds some improvements to the way the trim module considers
TXGs:

- Free ZIOs are registered with the TXG from the ZIO itself, not the
current SPA syncing TXG (which may be out of date);
- L2ARC are registered with a zero TXG number, as L2ARC has no concept
of TXGs;
- The TXG limit for issuing TRIMs is now computed from the last synced
TXG, not the currently syncing TXG. Indeed, under extremely unlikely
race conditions, there is a risk we could trim blocks which have been
freed in a TXG that has not finished syncing, resulting in potential
data corruption in case of a crash.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: https://github.com/dechamps/zfs/commit/5b46ad40d9081d75505d6f3bf04ac652445df366
MFC after: 2 weeks


248573 21-Mar-2013 smh

Don't register repair writes in the trim map.

The trim map inflight writes tree assumes non-conflicting writes, i.e.
that there will never be two simultaneous write I/Os to the same range
on the same vdev. This seemed like a sane assumption; however, in
actual testing, it appears that repair I/Os can very well conflict
with "normal" writes.

I'm not quite sure if these conflicting writes are supposed to happen
or not, but in the mean time, let's ignore repair writes for now. This
should be safe considering that, by definition, we never repair blocks
that are freed.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: Source: https://github.com/dechamps/zfs/commit/6a3cebaf7c5fcc92007280b5d403c15d0e61dfe3


248572 21-Mar-2013 smh

Add TRIM support for L2ARC.

This adds TRIM support to cache vdevs. When ARC buffers are removed
from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(),
the size previously occupied by the buffer gets scheduled for TRIMming.
As always, actual TRIMs are only issued to the L2ARC after
txg_trim_limit.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: https://github.com/dechamps/zfs/commit/31aae373994fd112256607edba7de2359da3e9dc
MFC after: 2 weeks


248571 21-Mar-2013 mm

Merge libzfs_core branch:
includes MFV 238590, 238592, 247580

MFV 238590, 238592:
In the first zfs ioctl restructuring phase, the libzfs_core library was
introduced. It is a new thin library that wraps around kernel ioctl's.
The idea is to provide a forward-compatible way of dealing with new
features. Arguments are passed in nvlists and not random zfs_cmd fields,
new-style ioctls are logged to pool history using a new method of
history logging.

http://blog.delphix.com/matt/2012/01/17/the-future-of-libzfs/

MFV 247580 [1]:
To address issues of several deadlocks and race conditions the locking
code around dsl_dataset was rewritten and the interface to synctasks
was changed.

User-Visible Changes:
"zfs snapshot" can create more arbitrary snapshots at once (atomically)
"zfs destroy" destroys multiple snapshots at once
"zfs recv" has improved performance

Backward Compatibility:
I have extended the compatibility layer to support full backward
compatibility by remapping or rewriting the responsible ioctl arguments.
Old utilities are fully supported by the new kernel module.

Forward Compatibility:
New utilities work with old kernels with the following restrictions:
- creating, destroying, holding and releasing of multiple snapshots
at once is not supported, this includes recursive (-r) commands

Illumos ZFS issues:
2882 implement libzfs_core
2900 "zfs snapshot" should be able to create multiple,
arbitrary snapshots at once
3464 zfs synctask code needs restructuring

References:
https://www.illumos.org/issues/2882
https://www.illumos.org/issues/2900
https://www.illumos.org/issues/3464 [1]

MFC after: 1 month
Sponsored by: Hybrid Logic Inc. [1]


248493 19-Mar-2013 mm

Plug memory leak in dsl_check_snap_cb()
This was unnoticed because the function is very rarely used.

MFC after: 3 days


248470 18-Mar-2013 jhb

Partially revert r195702. Deferring stops is now implemented via a set of
calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep
calls.
- Remove the stop_allowed parameters from cursig() and issignal().
issignal() checks TDF_SBDRY directly.
- Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.


248457 18-Mar-2013 jhibbits

Add FBT for PowerPC DTrace. Also, clean up the DTrace assembly code,
much of which is not necessary for PowerPC.

The FBT module can likely be factored into 3 separate files: common,
intel, and powerpc, rather than duplicating most of the code between
the x86 and PowerPC flavors.

All DTrace modules for PowerPC will be MFC'd together once Fasttrap is
completed.


248426 17-Mar-2013 mm

Fix typo in sysctl description

Reported by: Jeremy Chadwick
MFC after: 3 days


248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


248082 09-Mar-2013 attilio

Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object. This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: alc (vm_radix based version)
Tested by: flo, pho, jhb, davide


247852 05-Mar-2013 mm

MFV r247845:
Import ZFS bpobj bugfix from vendor.

Illumos ZFS issues:
3603 panic from bpobj_enqueue_subobj()
3604 zdb should print bpobjs more verbosely

References:
https://www.illumos.org/issues/3603
https://www.illumos.org/issues/3604

MFC after: 1 week


247820 04-Mar-2013 gibbs

Fix assertion failure when using userland DTrace probes from
the pid provider on a kernel compiled with INVARIANTS.

sys/cddl/contrib/opensolaris/uts/intel/dtrace/fasttrap_isa.c:
In fasttrap_probe_pid(), attempts to write to the
address space of the thread that fired the probe
must be performed with the process of the thread
held. Use _PHOLD() to ensure this is the case.

In fasttrap_probe_pid(), use proc_write_regs() instead
of calling set_regs() directly. proc_write_regs()
performs invariant checks to verify the calling
environment of set_regs(). PROC_LOCK()/UNLOCK() around
the call to proc_write_regs() so that it's invariants
are satisfied.

Sponsored by: Spectra Logic Corporation
Reviewed by: gnn, rpaulo
MFC after: 1 week


247684 03-Mar-2013 rpaulo

Remove the extra parenthesis from the cv_init() macro. They are not
necessary because we already use parenthesis in zfs_cv_init().

This fixes a long standing bug where there would be an extra ")" at the
end of the string. This extra parenthesis would show up in the WCHAN of
the process (top, stty status, etc.).


247602 02-Mar-2013 pjd

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


247592 01-Mar-2013 delphij

MFV r247575:

Import a fix tighten assertion on SPA versions from vendor (Illumos).

Illumos ZFS issue:

3543 Feature flags causes assertion in spa.c to miss certain cases

MFC after: 2 weeks


247585 01-Mar-2013 mm

MFV r247316:
Merge new read-only zfs properties from vendor (illumos)

Illumos ZFS issues:
3588 provide zfs properties for logical (uncompressed) space used and
referenced

References:
https://www.illumos.org/issues/3588

MFC after: 2 weeks


247540 01-Mar-2013 mm

Fix the zfs_ioctl compat layer to support zfs_cmd size change introduced
in r247265 (ZFS deadman thread). Both new utilities now support the old
kernel and new kernel properly detects old utilities.

For future backwards compatibility, the vfs.zfs.version.ioctl read-only
sysctl has been introduced. With this sysctl zfs utilities will be able
to detect the ioctl interface version of the currently loaded zfs module.

As a side effect, the zfs utilities between r247265 and this revision don't
support the old kernel module. If you are using HEAD newer or equal than
r247265, install the new kernel module (or whole kernel) first.

MFC after: 10 days


247398 27-Feb-2013 mm

MFV 247176, 247178, 247315:
Import metaslab_sync() speedup from vendor (illumos).

Illumos ZFS issues:
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not
condensing)
3578 transferring the freed map to the defer map should be constant time
3579 ztest trips assertion in metaslab_weight()

References:
https://www.illumos.org/issues/3552
https://www.illumos.org/issues/3564
https://www.illumos.org/issues/3578
https://www.illumos.org/issues/3579

MFC after: 2 weeks


247348 26-Feb-2013 mm

Be more verbose on ZFS deadman I/O panic
Patch suggested upstream.

Suggested by: Olivier Cinquin
MFC after: 12 days


247265 25-Feb-2013 mm

MFV v242732:

Merge the ZFS I/O deadman thread from vendor (illumos).
This feature panics the system on hanging ZFS I/O, helps debugging
and resumes failed service.

The panic behavior can be controlled with the loader-only tunables:
vfs.zfs.deadman_enabled (enable or disable panic on stalled ZFS I/O)
vfs.zfs.deadman_synctime (expiration time for stalled ZFS I/O)

By default, ZFS I/O deadman is enabled by default on amd64 and i386
excluding virtual guest machines.

Illumos ZFS issues:
3246 ZFS I/O deadman thread

References:
https://www.illumos.org/issues/3246

MFC after: 2 weeks


247187 23-Feb-2013 mm

MFV r246653:
Import vendor change to avoid "unitialized variable" warnings.

Illumos ZFS issues:
3522 zfs module should not allow uninitialized variables

References:
https://www.illumos.org/issues/3522


247049 20-Feb-2013 gibbs

Avoid panic when tearing down the DTrace pid provider for a
process that has crashed.

sys/cddl/contrib/opensolaris/uts/common/dtrace/fasttrap.c:
In fasttrap_pid_disable(), we cannot PHOLD the proc
structure for a process that no longer exists, but
we still have other, fasttrap specific, state that
must be cleaned up for probes that existed in the
dead process. Instead of returning early if the
process related to our probes isn't found,
conditionalize the locking and carry on with a NULL
proc pointer. The rest of the fasttrap code already
understands that a NULL proc is possible and does
the right things in this case.

Sponsored by: Spectra Logic Corporation
Reviewed by: rpaulo, gnn
MFC after: 1 week


246808 14-Feb-2013 delphij

Eliminate real_LZ4_uncompress. It's unused and does not perform sufficient
check against input stream (i.e. it could read beyond specified input
buffer).


246773 13-Feb-2013 mm

Change vfs.zfs.write_to_degraded from CTLFLAG_RW to CTLFLAG_RWTUN

Suggested by: pjd


246768 13-Feb-2013 delphij

Restore De Bruijn algorithm for sparc64 where the compiler rely on a
library function for __builtin_c?z.

Tested by: Michael Moll <kvedulv kvedulv de>


246688 11-Feb-2013 mm

Merge zfs_ioctl.c code that should have been merged together with ZFS v28.
Fixes several problems if working with read-only pools.

Changed code originaly introduced in onnv-gate 13061:bda0decf867b
Contains changes up to illumos-gate 13700:4bc0783f6064

PR: kern/175897
Suggested by: avg

MFC after: 2 weeks


246678 11-Feb-2013 mm

MFV r246633:
Import vendor bugfixes regarding SA rounding, header size and layout.
This was already partially fixed by avg.

Illumos ZFS issues:
3512 rounding discrepancy in sa_find_sizes()
3513 mismatch between SA header size and layout

References:
https://www.illumos.org/issues/3512
https://www.illumos.org/issues/3513

MFC after: 2 weeks


246675 11-Feb-2013 mm

MFV r246394:
Add tunable to allow block allocation on degraded vdevs.

Illumos ZFS issues:
3507 Tunable to allow block allocation even on degraded vdevs

References:
https://www.illumos.org/issues/3507

MFC after: 2 weeks


246666 11-Feb-2013 mm

MFV r246392:
Import vendor ZFS bugfix fixing a possible deadlock in arc_read().

Illumos ZFS issues:
3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)

References:
https://www.illumos.org/issues/3498

MFC after: 2 weeks


246651 11-Feb-2013 mm

MFV r246390:
Import minor type change in refcount.h header from vendor (illumos).

MFC after: 2 weeks


246631 10-Feb-2013 mm

MFV r246388:

Import vendor bugfixes

Illumos ZFS issues:
3422 zpool create/syseventd race yield non-importable pool
3425 first write to a new zvol can fail with EFBIG

References:
https://www.illumos.org/issues/3422
https://www.illumos.org/issues/3425

MFC after: 2 weeks


246586 09-Feb-2013 delphij

MFV r245512:

* Illumos zfs issue #3035 [1] LZ4 compression support in ZFS.

LZ4 is a new high-speed BSD-licensed compression algorithm created
by Yann Collet that delivers very high compression and decompression
performance compared to lzjb (>50% faster on compression, >80% faster
on decompression and around 3x faster on compression of incompressible
data), while giving better compression ratio [1].

This version of LZ4 corresponds to upstream's [2] revision 85.

Please note that for obvious reasons this is not backward read
compatible. This means once a pool have LZ4 compressed data, these
data can no longer be read by older ZFS implementations.

Local changes:

- On-stack hash table disabled and using kernel slab allocator
instead, at this time. This requires larger kernel thread stack
for zio workers. This may change in the future should we adjusted
the zio workers' thread stack size.
- likely and unlikely will be undefined if they are already defined,
this is required for i386 XEN build.
- Removed De Bruijn sequence based __builtin_ctz family of builtins
in favor of the latter. Both GCC and clang supports these builtins.
- Changed the way the LZ4 code detects endianness.
- Manual pages modifications to mention the feature based on Illumos
counterpart.
- Boot loader changes to make it support LZ4 decompression.

[1] https://www.illumos.org/issues/3035
[2] http://code.google.com/p/lz4/source/list

Obtained from: Illumos (13921:9d721847e469)
Tested on: FreeBSD/amd64
MFC after: 1 month


246538 08-Feb-2013 pluknet

Fix warning: comparison of unsigned expression < 0 is always false.

Reported by: clang


246532 08-Feb-2013 avg

zfs_vget, zfs_fhtovp: properly handle the z_shares_dir object

A special gfs vnode corresponds to that object.
A regular zfs vnode must not be returned.

This should be upstreamed.

Reported by: pluknet
Submitted by: rmacklem
Tested by: pluknet
MFC after: 10 days


246531 08-Feb-2013 avg

zfs: update comments about zfid_long_t to match the FreeBSD definitions

MFC after: 1 week


246293 03-Feb-2013 avg

zfs: fix, improve and re-organize page_lookup and page_unlock

Now they are split into two pairs: page_hold/page_unhold for mappedread
and page_busy/page_unbusy for update_pages.

For mappedread we simply hold a page that is to be used as a source if it
is resident and valid (and not busy). This is sufficient since we are
only doing page -> user buffer copying. There is no page <-> backing
storage I/O involved.

update_pages is now better split to properly handle the putpages case
(page -> arc) and the regular write case (arc -> page).

For the latter we use complete protocol of marking an object with
paging-in-progress and marking a page with io_start (busy count).
Also, in this case we remove the write bit from all page mappings and
clear dirty bits of the pages, the former is needed to ensure that the
latter does the right thing.
Additionally we update a page if it is cached instead of just freeing it
as was done before. This needs to be verified.

A minor detail: ZFS-backed pages should always be either fully valid
or fully invalid. Assert this and use simpler API that does not deal
with sub-page blocks.

Reviewed by: kib
MFC after: 26 days


246275 03-Feb-2013 jhibbits

Fix the PowerPC DTrace copy functions. The kernel doesn't hold the same view to
the user map, so use the md copy in/out functions provided by the kernel.

MFC with: r242723


246244 02-Feb-2013 avg

solaris compat: remove KM_ZERO

- there is no such flag in Solaris and derivatives
- the flag was added in an unrelated change
- the flag is not used

The proper way to allocate zeroed out memory is to use kmem_zalloc.

MFC after: 3 days


246242 02-Feb-2013 avg

zfs: add MODULE_VERSION for zfsctrl

This should allow the kernel linker to easily detect a situation
when the module is present both in a kernel and in a preloaded file
(zfs.ko).

Reviewed by: jhb
MFC after: 5 days


245945 26-Jan-2013 avg

spa_generate_rootconf: add support for old vdev labels

It seems that old ZFS versions (v15) completely omit "vdev_children"
property when there is a single child.

Reported by: jase
Tested by: jase
MFC after: 1 week


245511 16-Jan-2013 delphij

MFV r245510:

improve the comment in txg.c

Obtained from: Illumos (13910:f3454e0a097c)
MFC after: 2 weeks


245409 14-Jan-2013 kib

For zfs vnodes, use the standard inode number based hash algorithm.

Reviewed and tested by: peter
Sponsored by: The FreeBSD Foundation
MFC after: 5 days


245264 10-Jan-2013 delphij

The current ZFS code expects ddt_zap_count to always succeed by asserting
the underlying zap_count() to return no errors. However, it is possible
that the pool reaches to such a state where zap_count would return error,
leading to panics when a pool is imported.

This commit changes the ddt_zap_count to return error returned from
zap_count and handle the error appropriately. With this change, it's now
possible to let zpool rollback damaged transaction groups and import the
pool.

Obtained from: ZFS on Linux github (e8fd45a0f975c6b8ae8cd644714fc21f14fac2bf)
MFC after: 1 month


244635 23-Dec-2012 avg

zfs: solaris doesn't have KM_ZERO, kmem_zalloc should be used instead

To do: remove KM_ZERO declaration
Pointyhat to: avg (for mindlessly using the pseudo-flag)
MFC after: instantly (to fix stable/8 build)


244631 23-Dec-2012 rstone

Correct a series of errors in the hand-rolled locking for drace_debug.c:

- Use spinlock_enter()/spinlock_exit() to prevent a thread holding a
debug lock from being preempted to prevent other threads waiting
on that lock from starvation.

- Handle the possibility of CPU migration in between the fetch of curcpu
and the call to spinlock_enter() by saving curcpu in a local variable.

- Use memory barriers to prevent reordering of loads and stores of the
data protected by the lock outside of the critical section

- Eliminate false sharing of the locks by moving them into the structures
that they protect and aligning them to a cacheline boundary.

- Record the owning thread in the lock to make debugging future problems
easier.

Reviewed by: rpaulo (initial version)
MFC after: 2 weeks


244188 13-Dec-2012 smh

Added vfs.zfs.vdev.trim_on_init sysctl which allows full vdev trim on
initialisation to be enabled (1) / disabled (0) defaults to enabled.

This is useful for devices which have a slow trim speed and are either
new or have otherwise already been wiped e.g. secure erase.

PR: kern/173116
Submitted by: Steven Hartland
Approved by: pjd (mentor)


244187 13-Dec-2012 smh

Upgrades trim free request sizes before inserting them into to free map,
making range consolidation much more effective particularly for small
deletes.

This reduces memory used by the free map as well as reducing the number
of bio requests down to geom required to process all deletes.

In tests this achieved a factor of 10 reduction of trim ranges / geom
call downs.

While I'm here correct the description of zio_vdev_io_start.

PR: kern/173254
Submitted by: Steven Hartland
Approved by: pjd (mentor)


244155 12-Dec-2012 smh

Renamed zfs trim stats removing duplicate zio_trim identifier from the name
Added description option to kstats.
Added descriptions for zio_trim kstats

PR: kern/173113
Submitted by: Steven Hartland
Reviewed by: pjd
Approved by: pjd
MFC after: 2 weeks


243807 03-Dec-2012 delphij

Use SA_ZPL_CRTIME instead of SA_ZPL_CTIME for creation time.

Submitted by: phil.stone at gmx.com
MFC after: 2 weeks


243763 01-Dec-2012 avg

zfs_getpages: make use of vm_page_readahead_finish

Suggested by: kib
MFC after: 5 days


243762 01-Dec-2012 avg

gfs_file_inactive: replace bad code with ugly code

Also, make it explicit that V_XATTRDIR is not properly supported in gfs
code yet.

The bad code was plain incorrect: (a) it spoiled handling of v_usecount
reaching zero and (b) it leaked v_holdcnt.

The ugly code employs potentially unsafe locking tricks.

Ideally we should separate vnode lifecycle and gfs node lifecycle.
A gfs node should have its own reference count where its child nodes
should be accounted.

PR: kern/151111
Reviewed by: kib
MFC after: 13 days


243560 26-Nov-2012 mm

MFV r243395:

Introduce a new dataset aclmode setting "restricted" to protect ACL's
being destroyed or corrupted by a drive-by chmod.

illumos-gate 13889:a67716f16746
3254 add support in zfs for aclmode=restricted

References:
https://www.illumos.org/issues/3254

MFC after: 2 weeks


243525 25-Nov-2012 mm

Add loader(8) tunable to enable/disable nopwrite functionality:
vfs.zfs.nopwrite_enabled

MFC after: 2 weeks


243524 25-Nov-2012 mm

MFV r243013 and r243267:

Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.

It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.

IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write

References:
https://www.illumos.org/issues/3236

MFC after: 2 weeks


243521 25-Nov-2012 avg

zfs_freebsd_reclaim: remove a stray variable

... which leaked from a subsequent local change.
Unfortunately I noticed that only after commit.

MFC after: 5 weeks
X-MFC with: r243520


243520 25-Nov-2012 avg

zfs: overhaul zfs-vfs glue for vnode life-cycle management

* There is no need for the delayed destruction of znodes via taskqueue,
now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state
machine a bit simpler.

* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD
vop_inactive and vop_reclaim model. All destructive actions are done
in zfs_freebsd_reclaim.
This allows to simplify zfs_zget logic.

* Allow zfs_zget to return a doomed vnode if the current thread already
has an exclusive lock on the vnode.

* Clean up Solaris-isms like bailing out of reclaim/inactive on certain
values of v_usecount (aka v_count) or directly messing with this counter.

* Do not clear z_vnode while znode is still accessible.
z_vnode should be cleared only after zfs_znode_dmu_fini.
Otherwise zfs_zget may get an effectively half-deconstructed znode.
This allows to simplify zfs_zget logic further.

The above changes fix at least two known/reported problems:

o An indefinite wait in the following code path:
vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject ->
put_pages -> zfs_write -> zil_commit -> zfs_zget
This happened because vgone marks a vnode as VI_DOOMED before calling
VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any
circumstances.
The fix in this change is not complete as it won't fix a deadlock between
two threads doing VOP_RECLAIM where one thread is in zil_commit trying to
zfs_zget a znode/vnode being reclaimed by the other thread, which would be
blocked trying to enter zil_commit. This type of deadlock has not been
reported as of now.

o An indefinite wait in the unmount path caused by a znode "falling through
the cracks" in inactive+reclaim. This would happen if the znode is unlinked
while its vnode is still active.

To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs
glue code doesn't have to re-lock a vnode but could ask for proper locking
from the very start. This would also allow for the higher level code to
obtain a doomed vnode when it is expected/requested. Or to avoid blocking
when it is not allowed (see zil_commit example above).

ffs_vgetf seems like a good source of inspiration.

Tested by: Willem Jan Withagen <wjw@digiware.nl>
MFC after: 6 weeks


243519 25-Nov-2012 avg

zfs_fhtovp: there is no reason to amend lock flags with LK_RETRY here

MFC after: 12 days


243518 25-Nov-2012 avg

add zfs_bmap to aid vnode_pager_haspage

... otherwise zfs_getpages would mostly be called with one page at a time.

It is expected that ZFS VOP_BMAP is only called from vnode_pager_haspage.
Since ZFS files can have variable block sizes and also because we don't
really know if any given blocks are consecutive, we can not really report
any additional blocks behind or ahead of a given block. Since physical
block numbers do not make sense for ZFS, we do not do any real translation
and thus pass back blk = lblk. The net effect is that vnode_pager_haspage
knows that the block exists and that the pages backed by the block can be
accessed. vnode_pager_haspage may be wrong about the exact count of the
pages backed by the block, because of a variable block size, which
vnode_pager_haspage doesn't really know - it only knows max block size in
a filesystem. So pages from multiple blocks can be passed to zfs_getpages,
but that is expected and correctly handled.

vnode_pager should not call zfs_bmap for any other reason, because ZFS
implements VOP_PUTPAGES and thus vnode_pager_generic_getpages is not used.

vfs_cluster code vfs_bio code should not be called for ZFS, because ZFS does
not use buffer cache layer.

Also, ZFS does not use vn_bmap_seekhole, it has its prviate mechanism for
working with holes.

The above list should cover all the current calls to VOP_BMAP.

Reviewed by: kib
MFC after: 6 weeks


243517 25-Nov-2012 avg

zfs_getpages: optimize for large block sizes

MFC after: 6 weeks


243505 25-Nov-2012 mm

MFV r243012:

Illumos 13886:e3261d03efbf

3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version

References:
https://www.illumos.org/issues/3349

MFC after: 1 week


243503 25-Nov-2012 mm

MFV r242735:

Illumos 13879:4eac7a87eff2:
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

New loader-only tunables:
vfs.zfs.sync_pass_deferred_free
vfs.zfs.sync_pass_dont_compress
vfs.zfs.sync_pass_rewrite

References:
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335

MFC after: 2 weeks


243502 24-Nov-2012 avg

zfs roopool: add support for multi-vdev configurations

Tested by: madpilot
MFC after: 10 days


243501 24-Nov-2012 avg

spa_import_rootpool: initialize ub_version before calling spa_config_parse

... because the latter makes some decision based on the version.
This is especially important for raidz vdevs.
This is similar to what spa_load does.

This is not an issue for upstream because they do not seem to support
using raidz as a root pool.

Reported by: Andrei Lavreniyuk <andy.lavr@gmail.com>
Tested by: Andrei Lavreniyuk <andy.lavr@gmail.com>
MFC after: 6 days


243500 24-Nov-2012 avg

spa_import_rootpool: do not call spa_history_log_version

The call is a NOP, because pool version in spa_ubsync.ub_version is not
initialized and thus appears to be zero.
If the version is properly set then the call leads to a NULL pointer
dereference because the spa object is still under-constructed.

The same change was independently made in the upstream as a part of
a larger change (4445fffbbb1ea25fd0e9ea68b9380dd7a6709025).

MFC after: 6 days


243498 24-Nov-2012 avg

opensolaris compat: terminate cmn_err mesages with a new line

MFC after: 6 days


243497 24-Nov-2012 avg

zfs: create devices/geoms from zvols after receiveing them

PR: kern/167066
Tested by: Andreas Nilsson <andrnils@gmail.com>
MFC after: 13 days


243270 19-Nov-2012 avg

zfs_remove: assert that delete_now case is never true on FreeBSD

That case is specific to Solaris VFS and it would violate pretty
fundamental contracts of FreeBSD VFS.

Discussed with: pjd
MFC after: 12 days


243268 19-Nov-2012 avg

zfs_remove: set VV_NOSYNC flag if a node is unlinked

Suggested by: kib
MFC after: 12 days


243213 18-Nov-2012 avg

spa_import_rootpool: fall back to use configuration from zpool.cache...

if we fail to generate a proper root pool configuration based on disk
probing. Currently we can not properly generate the configuration for
multi-vdev pools. Make that explicit.

Reported by: madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
Tested by: madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
MFC after: 4 days


242958 13-Nov-2012 kib

Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month


242862 10-Nov-2012 avg

zfs_ioc_destroy_snaps_nvl: remove disk device entries for zvol snapshots

... before trying to destroy the zvol snapshots themselves.

PR: kern/173442
Reported by: Petri Helenius <petri@helenius.fi>,
mm
Obtained from: Brian Behlendorf <behlendorf1@llnl.gov>,
Illumos Bug #3170
Tested by: Petri Helenius <petri@helenius.fi>
MFC after: 10 days


242845 10-Nov-2012 delphij

MFV r242729 (mm):

Illumos r13840:97fd5cdf328a:

3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Illumos r13849:3468a95b27cd:

3258 ztest's use of file descriptors is unstable


242833 09-Nov-2012 attilio

Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.


242723 07-Nov-2012 jhibbits

Implement DTrace for PowerPC. This includes both 32-bit and 64-bit.

There is one known issue: Some probes will display an error message along the
lines of: "Invalid address (0)"

I tested this with both a simple dtrace probe and dtruss on a few different
binaries on 32-bit. I only compiled 64-bit, did not run it, but I don't expect
problems without the modules loaded. Volunteers are welcome.

MFC after: 1 month


242575 04-Nov-2012 avg

zfs_dirlook: bailout early if directory is unlinked

Otherwise we could fail with an incorrect error if e.g. parent
object id is removed too or we can even return a wrong vnode if
parent object has been already re-used.

Discussed with: pjd
Also see: http://article.gmane.org/gmane.os.freebsd.devel.file-systems/13863
MFC after: 26 days


242574 04-Nov-2012 avg

zfsctl_snapdir_lookup: obtain a snapname in the remount case

... which is triggered if somebody did regular umount on a snapshot mount.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
MFC after: 20 days


242573 04-Nov-2012 avg

zfs: set MNTK_EXTENDED_SHARED flag

Discussed with: kib
MFC after: 20 days


242572 04-Nov-2012 avg

opensolaris compat: clear VI_MOUNT before returning if mount_snapshot fails

To do: investigate if it would be possible to use normal vfs_domount here.

Reviewed by: kib
MFC after: 19 days


242571 04-Nov-2012 avg

zfs_vnode_forget: dispose of larvae vnode using public vfs api (mostly)

Reviewed by: kib
MFC after: 19 days


242570 04-Nov-2012 avg

zfs_umount: no need to set MNTK_UNMOUNTF here, dounmount handles that

Reviewed by: kib
MFC after: 19 days


242569 04-Nov-2012 avg

opensolaris_lookup: use vfs_busy in traverse before calling VFS_ROOT

... to ensure that we have a valid mountpoint during the call.

Reviewed by: kib
MFC after: 19 days


242568 04-Nov-2012 avg

zfs_vnode_lock: no need to double-guess caller's intentions here

vn_lock should do the right thing with respect to given vnode lock
flags. If a caller doesn't mind a doomed vnode, then zfs should deliver.

Reviewed by: kib
MFC after: 19 days


242567 04-Nov-2012 avg

zfs_mount: drop vfs.zfs.rootpool.prefer_cached_config tunable

It turned out to be not that useful, because its default value may lead
to a problem when a root pool is present in zpool.cache, but its
on-disk status is 'exported'. This may happen if the pool was imported
in a different environment with -f flag and then exported.

MFC after: 12 days


242566 04-Nov-2012 avg

zfs_freebsd_close: call zfs_close with count=1 instead of count=0

Otherwise we may be leaking z_sync_cnt, which may lead to unnecessary
ZIL sync-ing.

MFC after: 12 days


242332 30-Oct-2012 delphij

s/dettach/detach/g

Approved by: pjd
MFC after: 1 month


242135 26-Oct-2012 avg

zfs: fix label validation code in vdev_geom_read_config

POOL_STATE_SPARE and POOL_STATE_L2CACHE were not handled correctly
and thus the cache and spare disks would not be correctly probed.

Reported by: Michael Schmiedgen <schmiedgen@gmx.net>,
Matthew D. Fuller <fullermd@over-yonder.net>
Tested by: Michael Schmiedgen <schmiedgen@gmx.net>,
flo
MFC after: 5 days


241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


241773 20-Oct-2012 avg

zfs: wait in arc_lowmem only if curproc == pageproc

... otherwise the current thread might be holding ARC locks and thus run
into a deadlock. This happens, for example, when a thread does memory
allocation in the ARC code and runs into KVA shortage.
Also, it really makes the most sense to wait in pageproc, so that the
results of ARC reclamation are seen before the page cache is acted upon.
In other cases where vm_lowmem is invoked, e.g. on KVA space shortage,
the callers perform multiple attempts (up to 8) and wait for rather
long intervals between them (up to 4 seconds), so ARC reclaim results
should become visible even without explicit waiting on the ARC thread.

Note that this is not a critical issue for typical ZFS usages where KVA
space should already be large enough. On amd64 systems setting KVA size
to twice the physical memory size is known to mitigate KVA fragmentation
issues in practice.

Side note: perhaps vm_lowmem 'how' parameter should be used to
differentiate between causes of the event.

Reported by: Nikolay Denev <ndenev@gmail.com>
MFC after: 19 days


241628 17-Oct-2012 avg

zfs: make use of getnewvnode_reserve in zfs_mknode and zfs_zget

getnewvnode_reserve helps to avoid "recursing" back into zfs code
via getnewvnode when that latter needs to reclaim some vnodes.
zfs code may hold a number of locks around getnewvnode and doesn't
expect any recursion to happen on those locks, because that never
happens in solaris.

I believe that this change also eleiminates a need for the delayed
znode destruction via the taskqueue.

Many thanks to kib for devising getnewvnode_reserve.

Reported by: flo
Tested by: bapt, kwm, swills
MFC after: 2 weeks
X-MFC after: r241556


241394 10-Oct-2012 kevlo

Revert previous commit...

Pointyhat to: kevlo (myself)


241370 09-Oct-2012 kevlo

Prefer NULL over 0 for pointers


241297 06-Oct-2012 avg

zvol: set mediasize in geom provider right upon its creation

... instead of deferring the action until first open.
Unlike upstream this has no benefit on FreeBSD.
We know that as soon as the provider is created it is going to be tasted
and thus opened. Initial mediasize of zero causes tasting failure
and subsequent retasting because of the size change.

MFC after: 14 days


241286 06-Oct-2012 avg

zfs_mount: taste geom providers for root pool config

This should allow to mount a dataset as a root filesystem even if
it belongs to a pool that is not described in zpool.cache.
This adds some overhead to the boot process though.

If the root filesystem's pool is found in zpool.cache, the by default
its cached configuration will be used for import.
vfs.zfs.rootpool.prefer_cached_config could be set to zero to force
the config to be retasted.

Discussed with: gibbs, pjd, des
MFC after: 25 days


240955 26-Sep-2012 mm

Merge recent vendor changes in ZFS.

Illumos issued covered:
2811 missing implementation: zfs send -r
3139 zdb dies when it tries to determine path of unlinked file
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg
3208 moving zpool cross-endian results in incorrect user/group accounting

References:
https://www.illumos.org/issues/ + [issue_id]

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


240870 23-Sep-2012 pjd

It is possible to recursively destroy snapshots even if the snapshot
doesn't exist on a dataset we are starting from. For example if we
have the following configuration:

tank
tank/foo
tank/foo@snap
tank/bar
tank/bar@snap

We can execute:

# zfs destroy -t tank@snap

eventhough tank@snap doesn't exit.

Unfortunately it is not possible to do the same with recursive rename:

# zfs rename -r tank@snap tank@pans
cannot open 'tank@snap': dataset does not exist

...until now. This change allows to recursively rename snapshots even if
snapshot doesn't exist on the starting dataset.

Sponsored by: rsync.net
MFC after: 2 weeks


240868 23-Sep-2012 pjd

Add TRIM support.

The code builds a map of regions that were freed. On every write the
code consults the map and eventually removes ranges that were freed
before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines
how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes.
During TRIM we keep in-flight ranges on a list to detect colliding
writes - we have to delay writes that collide with in-flight TRIMs in
case something will be reordered and write will reached the disk before
the TRIM. We don't have to do the same for in-flight writes, as
colliding writes just remove ranges to TRIM.

Sponsored by: multiplay.co.uk

This work includes some important fixes and some improvements obtained
from the zfsonlinux project, including TRIMming entire vdevs on pool
create/add/attach and on pool import for spare and cache vdevs.

Obtained from: zfsonlinux
Submitted by: Etienne Dechamps <etienne.dechamps@ovh.net>


240831 22-Sep-2012 avg

zfs: allow a zvol to be used as a pool vdev, again

Do this by checking if spa_namespace_lock is already held and not taking
it again in that case.
Add a comment explaining why that is done and why it is safe.

Reviewed by: pjd
MFC after: 24 days


240829 22-Sep-2012 pjd

As in r226967, r226987 and r232401 changes to UFS and TMPFS remove cache
entries associated with the source and the target of rename().

MFC after: 1 week


240632 18-Sep-2012 avg

zfs: correctly calculate dn_bonuslen for saving SAs to disk

Since all attribute values start at 8-byte aligned boundary, we would
previously incorrectly calculate dn_bonuslen if any attribute but the
last had a variable-length value with length not multiple of 8.

Reported by: Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Tested by: Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> (for upstream)
MFC after: 2 weeks


240631 18-Sep-2012 avg

zfs: allow both DEBUG and ZFS_DEBUG to be defined on command line

Discussed with: pjd
MFC after: 10 days


240415 12-Sep-2012 mm

Merge recent zfs vendor changes, sync code and adjust userland DEBUG.

Illumos issued covered:
1884 Empty "used" field for zfs *space commands
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument
is zero
3028 zfs {group,user}space -n prints (null) instead of numeric GID/UID
3048 zfs {user,group}space [-s|-S] is broken
3049 zfs {user,group}space -t doesn't really filter the results
3060 zfs {user,group}space -H output isn't tab-delimited
3061 zfs {user,group}space -o doesn't use specified fields order
3064 usr/src/cmd/zpool/zpool_main.c misspells "successful"
3093 zfs {user,group}space's -i is noop
3098 zfs userspace/groupspace fail without saying why when run as non-root

References:
https://www.illumos.org/issues/ + [issue_id]

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


240356 11-Sep-2012 avg

forgotten file from r240346

Pointyhat to: avg
MFC after: 10 days
X-MFC with: r240346


240345 11-Sep-2012 avg

zfs: fix sa_modify_attrs handling of variable-sized attributes

- skip length_idx index for a replaced variable-sized attribute
- skip length_idx index for a removed variable-sized attribute
- also re-arranged code to make sure that length_idx is always
incremented for variable-sized attributes
- additionally add an assertion that the number of actually produced
attributes is the same as the expected number of resulting
attributes

In cooperation with: Matthew Ahrens <mahrens@delphix.com>
Tested by: Trent Nelson <trent@snakebite.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> (for upstream)
To do: get this upstreamed
MFC after: 2 weeks


240303 10-Sep-2012 mm

Add assfail() and assfail3() to the opensolaris module.
Remove obsoleted intermediate cddl/compat/opensolaris/sys/debug.h.

MFC after: 2 weeks


240162 06-Sep-2012 mm

Make r230454 more readable and vendor-like.

PR: kern/171380
MFC after: 3 days


240133 05-Sep-2012 mm

Merge recent vendor changes and sync code:
1862 incremental zfs receive fails for sparse file > 8PB
3112 ztest does not honor ZFS_DEBUG
3122 zfs destroy filesystem should prefetch blocks
3129 'zpool reopen' restarts resilvers
3130 ztest failure: Assertion failed:
0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)

References:
https://www.illumos.org/issues/1862
https://www.illumos.org/issues/3112
https://www.illumos.org/issues/3122
https://www.illumos.org/issues/3129
https://www.illumos.org/issues/3130

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


239786 28-Aug-2012 ed

Use a proper destructor function.

When calling a revoke(2) on a dtrace device, dtrace_close() could be
called, even if threads are still stuck in the device. Defer the actual
deallocation of datastructures to the cdevpriv destructor.

While there, remove the unneeded D_TRACKCLOSE and D_NEEDMINOR flags. For
the helper device, we never need it. For the regular dtrace devices, we
only need these flags on FreeBSD pre-8.

MFC after: 1 month


239774 28-Aug-2012 mm

Merge recent vendor changes:
3100 zvol rename fails with EBUSY when dirty
3104 eliminate empty bpobjs
3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc

References:
https://www.illumos.org/issues/3100
https://www.illumos.org/issues/3104
https://www.illumos.org/issues/3120

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


239620 23-Aug-2012 mm

Merge recent vendor changes:
3086 unnecessarily setting DS_FLAG_INCONSISTENT on async destroyed datasets
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label

Referenes:
https://www.illumos.org/issues/3086
https://www.illumos.org/issues/3090
https://www.illumos.org/issues/3102

PR: kern/170912, kern/170914
Obtained from: illumos (changeset #13776, #13777)
MFC after: 2 weeks


239389 19-Aug-2012 mm

Backport fix for vendor issue #3085
3085 zfs diff panics, then panics in a loop on booting

References:
https://www.illumos.org/issues/3085

PR: kern/170763
Obtained from: ssh://anonhg@hg.illumos.org/illumos-gate (r13772)
MFC after: 1 week


239303 15-Aug-2012 hselasky

Streamline use of cdevpriv and correct some corner cases.

1) It is not useful to call "devfs_clear_cdevpriv()" from
"d_close" callbacks, hence for example read, write, ioctl and
so on might be sleeping at the time of "d_close" being called
and then then freed private data can still be accessed.
Examples: dtrace, linux_compat, ksyms (all fixed by this patch)

2) In sys/dev/drm* there are some cases in which memory will
be freed twice, if open fails, first by code in the open
routine, secondly by the cdevpriv destructor. Move registration
of the cdevpriv to the end of the drm open routines.

3) devfs_clear_cdevpriv() is not called if the "d_open" callback
registered cdevpriv data and the "d_open" callback function
returned an error. Fix this.

Discussed with: phk
MFC after: 2 weeks


239077 05-Aug-2012 marius

Include <vm/vm_param.h> for PA_LOCK_COUNT in order to fix kernel build
with options ZFS after r239065.


238926 30-Jul-2012 mm

Partial MFV (illumos-gate 13753:2aba784c276b)
2762 zpool command should have better support for feature flags

References:
https://www.illumos.org/issues/2762

MFC after: 2 weeks


238656 20-Jul-2012 trasz

Make ZVOL resizing ('zfs set volsize') properly resize the GEOM provider.

Sponsored by: FreeBSD Foundation


238552 17-Jul-2012 gnn

Change UL to ULL since time is 32 bits.

Pointed out by: avg@
MFC after: 2 weeks


238537 16-Jul-2012 gnn

Add support for walltimestamp in DTrace.

Submitted by: Fabian Keil
MFC after: 2 weeks


238169 06-Jul-2012 avg

r237748 continuation: fix nopw (0f 1f) behavior with respect to modifiers

To do: proper merge with Illumos vendor area.

Reported by: emaste
Tested by: emaste
Obtained from: Illumos commit 13442:4adbe6de60c8
MFC after: 5 days


238168 06-Jul-2012 avg

r237748 continuation: segment-override prefixes are not invalid in long mode

Update DTrace disassembler accordingly. The code to treat the prefixes
as null prefixes was already in place.
Although in practice compilers seem to generate only cs-prefix for use
in long NOPs, the same treatment is applied to all of cs, ds, es, ss for
consistency.

Reported by: emaste
Tested by: emaste
Obtained from: Illumos commit 13442:4adbe6de60c8 (+ local changes)
MFC after: 5 days


238113 04-Jul-2012 pjd

vdev_io_done stage is not used for ioctls.

MFC after: 1 week


237972 02-Jul-2012 mm

Expose scrub and resilver tunables.
This allows the user to tune the priority trade-off between scrub/resilver
and other ZFS I/O.

MFC after: 2 weeks
Discussed with: pjd


237817 29-Jun-2012 pfg

Bump dtrace_helper_actions_max from 32 to 128

Dave Pacheco from Joyent (and Dtrace.org) bumped the cap to 1024 but,
according to his blog, 128 is the recommended minimum.

For now bump it safely to 128 although we may have to bump it further
if there is demand in the future.

Reference:

http://www.illumos.org/issues/2558
http://dtrace.org/blogs/dap/2012/01/50/where-does-your-node-program-spend-its-time/


237748 29-Jun-2012 avg

dtrace instruction decoder: add 0x0f 0x1f NOP opcode support

According to the AMD manual the whole range from 0x09 to 0x1f are NOPs.
Intel manual mentions only 0x1f. Use only Intel one for now, it seems
to be the one actually generated by compilers.
Use gdb mnemonic for the operation: "nopw".

[1] AMD64 Architecture Programmer's Manual
Volume 3: General-Purpose and System Instructions
[2] Software Optimization Guide for AMD Family 10h Processors
[3] Intel(R) 64 and IA-32 Architectures Software Developer’s Manual
Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z

Tested by: Fabian Keil <freebsd-listen@fabiankeil.de> (earlier version)
MFC after: 3 days


237624 27-Jun-2012 pfg

Bring llquantize support into Dtrace.

Bryan Cantrill implemented the equivalent of semi-log graph
paper for Dtrace so llquantize will use one logarithmic and
one linear scale.

Special thanks to Mark Peek for providing fix to an
assertion and to Fabian Keill for testing the port.

Illumos Revision: 13355:15b74a2a9a9d

Reference:
https://www.illumos/issues/905

Obtained from: Illumos
Tested by: Fabian Keill, mp
MFC after: 4 days


237458 22-Jun-2012 mm

Import Illumos revision 13736:9f1d48e1681f
2901 ZFS receive fails for exabyte sparse files

References:
https://www.illumos.org/issues/2901

Obtained from: illumos (issue #2901)
MFC after: 1 week


236884 11-Jun-2012 mm

Introduce "feature flags" for ZFS pools (bump SPA version to 5000).
Add first feature "com.delphix:async_destroy" (asynchronous destroy
of ZFS datasets).
Implement features support in ZFS boot code.

Illumos revisions merged:
13700:2889e2596bd6
13701:1949b688d5fb
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags

References:
https://www.illumos.org/issues/2619
https://www.illumos.org/issues/2747

Obtained from: illumos (issue #2619, #2747)
MFC after: 1 month


236823 09-Jun-2012 pjd

ds_guid of 0 is special, as it is used by snapshot receive code to
differentiate between an incremental and full stream.
Be sure not to generate guid equal to 0.

Reported by: someone who saw 0 being generated as 64bit random guid
MFC after: 3 days


236567 04-Jun-2012 gnn

Integrate a fix for a very odd signal delivery problem found
by Bryan Cantril and others in the Solaris/Illumos version of DTrace.

Obtained from: https://www.illumos.org/issues/789
MFC after: 2 weeks


236566 04-Jun-2012 zml

Fix DTrace TSC skew calculation:

The skew calculation here is exactly backwards. We were able to repro
it on a multi-package ESX server running a FreeBSD VM, where the TSCs
can be pretty evil.

MFC after: 1 week

Submitted by: Jeff Ford <jeffrey.ford2@isilon.com>
Reviewed by: avg, gnn


236250 29-May-2012 pjd

Tighten up the assertion: because size can't be 0 and even if sm_space is equal
to sm_size, any 'sm_space - size' will be less than sm_size.

MFC after: 3 days


236249 29-May-2012 pjd

Eliminate 'where' argument, we don't use it.

MFC after: 3 days


236248 29-May-2012 pjd

Remove unused variable.

MFC after: 3 days


236247 29-May-2012 pjd

Remove unused sysctl.

MFC after: 3 days


236155 27-May-2012 mm

Import illumos changeset 13570:3411fd5f1589
1948 zpool list should show more detailed pool information

Display per-vdev information with "zpool list -v".
The added expandsize property has currently no value on FreeBSD.
This changeset allows adding expansion support to individual vdevs
in the future.

References:
https://www.illumos.org/issues/1948

Obtained from: illumos (issue #1948)
MFC after: 2 weeks


236146 27-May-2012 mm

Import illumos changeset 13605:b5c2b5db80d6 (partial)
763 FMD msg URLs should refer to something visible

Replace sun.com URL's with illumos.org

References:
https://www.illumos.org/issues/763

Obtained from: illumos (issue #763)
MFC after: 1 week


235781 22-May-2012 trasz

Fix enforcement of file size limit with O_APPEND on ZFS.

vn_rlimit_fsize takes uio->uio_offset and uio->uio_resid into account
when determining whether given write would exceed RLIMIT_FSIZE.

When APPEND flag is specified, ZFS updates uio->uio_offset to point to the
end of file.

But this happens after a call to vn_rlimit_fsize, so vn_rlimit_fsize check
can be rendered ineffective by thread that opens some file with O_APPEND
and lseeks below RLIMIT_FSIZE before calling write.

Submitted by: Mateusz Guzik <mjguzik at gmail dot com>
MFC after: 2 weeks


235343 12-May-2012 avg

add a zfs spa_t change missed in r235329

sys/cddl/boot is obviously not under sys/boot...

Pointed out by: Jan Beich <jbeich@tormail.org>
Pointyhat to: avg
MFC after: 1 month


235222 10-May-2012 mm

Import illumos changeset 13686:4bc0783f6064
2703 add mechanism to report ZFS send progress

If the zfs send command is used with the -v flag, the amount of bytes
transmitted is reported in per second updates.

References:
https://www.illumos.org/issues/2703

Obtained from: illumos (issue #2703)
MFC after: 2 weeks


234795 29-Apr-2012 marius

Partially revert r232938; ZFS only requires nfs4 but not posix1e.

Submitted by: jhb


234691 26-Apr-2012 rstone

Implement the D "cpu" variable, which returns curcpu. I have chosen not
to follow the example of OpenSolaris and its descendants, which implemented
cpu as an inline that took a value out of curthread. At certain points in
the FreeBSD scheduler curthread->td_oncpu will no longer be valid (in
particukar, just before the thread gets descheduled) so instead I have
implemented this as its own built-in variable.

Sponsored by: Sandvine Inc.
MFC after: 1 week


234607 23-Apr-2012 trasz

Remove unused thread argument to vrecycle().

Reviewed by: kib


234064 09-Apr-2012 attilio

- Introduce a cache-miss optimization for consistency with other
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.

Reviewed by: alc
MFC after: 2 weeks
X-MFC: r234039


233918 05-Apr-2012 avg

zfs_ioctl: no need for ddi_copyin/out here because sys_ioctl handles that

On FreeBSD the direct ioctl argument is automatically copied in/out
as necesary by the kernel ioctl entry point.

PR: kern/164445
Submitted by: Luis Garces-Erice <lge@ieee.org>
Tested by: Attila Nagy <bra@fsn.hu>
MFC after: 5 days


233552 27-Mar-2012 rstone

Instead of only iterating over the set of known SDT probes when sdt.ko is
loaded and unloaded, also have sdt.ko register callbacks with kern_sdt.c
that will be called when a newly loaded KLD module adds more probes or
a module with probes is unloaded.

This fixes two issues: first, if a module with SDT probes was loaded after
sdt.ko was loaded, those new probes would not be available in DTrace.
Second, if a module with SDT probes was unloaded while sdt.ko was loaded,
the kernel would panic the next time DTrace had cause to try and do
anything with the no-longer-existent probes.

This makes it possible to create SDT probes in KLD modules, although there
are still two caveats: first, any SDT probes in a KLD module must be part
of a DTrace provider that is defined in that module. At present DTrace
only destroys probes when the provider is destroyed, so you can still
panic the system if a KLD module creates new probes in a provider from a
different module(including the kernel) and then unload the the first module.

Second, the system will panic if you unload a module containing SDT probes
while there is an active D script that has enabled those probes.

MFC after: 1 month


233525 26-Mar-2012 gonzo

- For o32 ABI get arguments from the stack
- Clear CPU_DTRACE_FAULT flag in userland backtrace routine. It just
means we hit wrong memory region and should stop.


233521 26-Mar-2012 gonzo

Properly cast 64-bit dofhp_dof to pointer.

For i386 this change is no-op. For AMD64 it was tested with DTrace test
suite: results are the same from the test run before the change and after


233484 26-Mar-2012 gonzo

Use macroses to load/store pointers and increase indexes instead of
hardcoded MIPS64 instructions


233409 24-Mar-2012 gonzo

Add device part of DTrace/MIPS code


233408 24-Mar-2012 gonzo

Add MIPS support to cddl/contrib part:

- header and stub .c file for fasttrap module. It's not supported on
MIPS yet, but there is no way to disable support completely
- Do as amd64 trying to limit allocated memory


232938 13-Mar-2012 adrian

Add dependencies onto acl_posix1e and acl_nfs4.


232186 26-Feb-2012 mm

Analogous to r232059, add a parameter for the ZFS file system:

allow.mount.zfs:
allow mounting the zfs filesystem inside a jail

This way the permssions for mounting all current VFCF_JAIL filesystems
inside a jail are controlled wia allow.mount.* jail parameters.

Update sysctl descriptions.
Update jail(8) and zfs(8) manpages.

TODO: document the connection of allow.mount.* and VFCF_JAIL for kernel
developers

MFC after: 10 days


231949 21-Feb-2012 kib

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


231852 17-Feb-2012 bz

Merge multi-FIB IPv6 support from projects/multi-fibv6/head/:

Extend the so far IPv4-only support for multiple routing tables (FIBs)
introduced in r178888 to IPv6 providing feature parity.

This includes an extended rtalloc(9) KPI for IPv6, the necessary
adjustments to the network stack, and user land support as in netstat.

Sponsored by: Cisco Systems, Inc.
Reviewed by: melifaro (basically)
MFC after: 10 days


230945 03-Feb-2012 mm

Revert r230913 and r230914.

The initialization was correct, the problem needs deeper analysis.


230914 02-Feb-2012 mm

Add copyright information on last commits to comply with CDDL.

Discussed with: pluknet@
MFC after: 3 days


230913 02-Feb-2012 mm

Fix out of bounds write causing random panics,
uncovered by the change in r230256

Reviewed by: pluknet@
MFC after: 3 days


230689 29-Jan-2012 kmacy

always exclude data bufs regardless of debug settings


230647 28-Jan-2012 kmacy

add tunable for developers working on areas outside of ZFS
to further reduce core size by excluding ARC metadata buffers
from core dumps


230623 27-Jan-2012 kmacy

exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create

Reviewed by: alc, avg
MFC after: 2 weeks


230514 24-Jan-2012 mm

Merge illumos revisions 13572, 13573, 13574:

Rev. 13572:
disk sync write perf regression when slog is used post oi_148 [1]

Rev. 13573:
crash during reguid causes stale config [2]
allow and unallow missing from zpool history since removal of pyzfs [5]

Rev. 13574:
leaking a vdev when removing an l2cache device [3]
memory leak when adding a file-based l2arc device [4]
leak in ZFS from metaslab_group_create and zfs_ereport_checksum [6]

References:
https://www.illumos.org/issues/1909 [1]
https://www.illumos.org/issues/1949 [2]
https://www.illumos.org/issues/1951 [3]
https://www.illumos.org/issues/1952 [4]
https://www.illumos.org/issues/1953 [5]
https://www.illumos.org/issues/1954 [6]

Obtained from: illumos (issues #1909, #1949, #1951, #1952, #1953, #1954)
MFC after: 2 weeks


230454 22-Jan-2012 pjd

Use provided name when allocating ksid domain. It isn't really used on FreeBSD,
but should fix a panic when pool is imported from another OS that is using this.

MFC after: 1 week


230438 21-Jan-2012 pjd

Dramatically optimize listing snapshots when user requests only snapshot
names and wants to sort them by name, ie. when executes:

# zfs list -t snapshot -o name -s name

Because only name is needed we don't have to read all snapshot properties.

Below you can find how long does it take to list 34509 snapshots from a single
disk pool before and after this change with cold and warm cache:

before:

# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s

after:

# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s

MFC after: 1 week


230397 20-Jan-2012 pjd

By default turn off prefetch when listing snapshots.
In my tests it makes listing snapshots 19% faster with cold cache and
47% faster with warm cache.

MFC after: 1 week


230256 17-Jan-2012 pluknet

Fix the "lock &zrl->zr_mtx already initialized" assertion by initializing
the allocated memory before calling mtx_init(9) on mtx pointing to it.
Otherwize, random contents of uninitialized memory might occasionally
trigger the assertion.

Reported by: Pavel Polyakov <bsd kobyla org>
Reviewed by: pjd
MFC after: 1 week


229663 05-Jan-2012 pjd

- Allow to change vfs.zfs.arc_meta_limit at runtime.
- Change vfs.zfs.arc_meta_used from CTLFLAG_RDTUN to CTLFLAG_RD, as it is
not a tunable.

MFC after: 3 days


229425 03-Jan-2012 dim

In sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c, check the
the number of links against LINK_MAX (which is INT16_MAX), not against
UINT32_MAX. Otherwise, the constant would implicitly be converted to
-1.

Reviewed by: pjd
MFC after: 1 week


228710 19-Dec-2011 avg

opensolaris compat: fix vcmn_err so that panic(9) produces a proper message

... instead of just a verbatim format string.

Reviewed by: pjd
MFC after: 1 week


228686 18-Dec-2011 pjd

From time to time people report space map corruption resulting in panic
(ss == NULL) on pool import. I had such a panic recently. With current version
of ZFS it is still possible to import the pool in readonly mode and backup
all the data, but in case it is impossible for some reason add tunable
vfs.zfs.space_map_last_hope, which when set to '1' will tell ZFS to remove
colliding range and retry. This seems to have worked for me, but I consider
it highly risky to use.

MFC after: 1 week


228685 18-Dec-2011 pjd

Implement replying of ACLs updates. ACL changes should go to ZIL only
if the 'sync' property is set to 'always', so replying them is not common.

MFC after: 1 month


228448 12-Dec-2011 attilio

Revert the approach for skipping lockstat_probe_func call when doing
lock_success/lock_failure, introduced in r228424, by directly skipping
in dtrace_probe.

This mainly helps in avoiding namespace pollution and thus lockstat.h
dependency by systm.h.

As an added bonus, this also helps in MFC case.
Reviewed by: avg
MFC after: 3 months (or never)
X-MFC: r228424


228392 10-Dec-2011 pjd

Move ru_inblock increment into arc_read_nolock() so we don't account for
cached reads.

Discussed with: gibbs
No objections from: avg
Tested by: Marcus Reid <marcus@blazingdot.com>
MFC after: 1 week


228363 09-Dec-2011 pjd

The vfs.zfs.txg.timeout sysctl can be safely modified at run time.

MFC after: 1 week


228104 28-Nov-2011 mm

Fix typo in copyright notice.

MFC after: 1 month


228103 28-Nov-2011 mm

Merge new ZFS features from illumos:

1644 add ZFS "clones" property
https://www.illumos.org/issues/1644

1645 add ZFS "written" and "written@..." properties
https://www.illumos.org/issues/1645

1646 "zfs send" should estimate size of stream
https://www.illumos.org/issues/1646

1647 "zfs destroy" should determine space reclaimed by destroying multiple
snapshots
https://www.illumos.org/issues/1647

1693 persistent 'comment' field for a zpool
https://www.illumos.org/issues/1693

1708 adjust size of zpool history data
https://www.illumos.org/issues/1708

1748 desire support for reguid in zfs
https://www.illumos.org/issues/1748

Obtained from: illumos (changesets 13514, 13524, 13525)
MFC after: 1 month


227697 19-Nov-2011 kib

Existing VOP_VPTOCNP() interface has a fatal flow that is critical for
nullfs. The problem is that resulting vnode is only required to be
held on return from the successfull call to vop, instead of being
referenced.

Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination
with the VOP_VPTOCNP() interface means that the directory vnode
returned from VOP_VPTOCNP() is reclaimed in advance, causing
vn_fullpath() to error with EBADF or like.

Change the interface for VOP_VPTOCNP(), now the dvp must be
referenced. Convert all in-tree implementations of VOP_VPTOCNP(),
which is trivial, because vhold(9) and vref(9) are similar in the
locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(),
if any, should have no trouble with the fix.

Tested by: pho
Reviewed by: mckusick
MFC after: 3 weeks (subject of re approval)


227441 11-Nov-2011 rstone

Correct the types of the arguments to return probes of the syscall
provider. Previously we were erroneously supplying the argument types of
the corresponding entry probe.

Reviewed by: rpaulo
MFC after: 1 week


227430 10-Nov-2011 rstone

On i386, fbt probes are implemented by writing an invalid opcode over
certain instructions in a function prologue or epilogue. DTrace has a
hook into the invalid opcode fault handler that checks whether the fault
was due to an probe and if so, runs the DTrace magic.

Upon returning from an invalid opcode fault caused by a probe, DTrace must
emulate the instruction that was replaced with the invalid opcode and then
return control to the instruction following the invalid opcode.

There were a pair of related bugs in the emulation for the leave
instruction. The leave instruction is used to pop off a stack frame prior
to returning from a function. The emulation for this instruction must
move the trap frame for the invalid opcode fault down the stack to the
bottom of the stack frame that is being removed, and then execute an iret.

At two points in this process, the emulation code was storing values above
the current value of the stack pointer. This opened up a window in which
if we were two take an interrupt, the trap frame for the interrupt would
overwrite the values stored on the stack, causing the system to panic
later.

The first bug was that at one point the emulation code saves the new value
for $esp above the current stack pointer value. The fix is to save this
value instead inside of the original trap frame. At this point we do
not need the original trap frame so this is safe.

The second bug is that when the emulate code loads $esp from the stack, it
points part-way through the new trap frame instead of at its beginning.
The emulation code adjusts the stack pointer to the correct value
immediately afterwards, but this still leaves a one instruction window in
which an interrupt would corrupt this trap frame. Fix this by adjusting
the stack frame value before loading it into $esp.

This fixes panics in invop_leave on i386 when using fbt return probes.

Reviewed by: rpaulo, attilio
MFC after: 1 week


227293 07-Nov-2011 ed

Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.

This means that their use is restricted to a single C file.


227291 07-Nov-2011 rstone

Replace fasttrap_copyout() with uwrite(). FreeBSD copyout() is not able to
write to the .text section of a process.

Obtained from: rpaulo
MFC after: 3 days


227111 05-Nov-2011 pjd

Correct typo in comment.

Reported by: Fabian Keil <fk@fabiankeil.de>
MFC after: 3 days


227110 05-Nov-2011 pjd

In zvol_open() if the spa_namespace_lock is already held, it means that
ZFS is trying to open and taste ZVOL as its VDEV. This is not supported,
so return an error instead of panicing on spa_namespace_lock recursion.

Reported by: Robert Millan <rmh@debian.org>
PR: kern/162008
MFC after: 3 days


226732 25-Oct-2011 mm

Fix typo in copyright notice introduced in r226724
(missing character in e-mail adress)

Reported by: pjd
MFC after: 3 days


226724 25-Oct-2011 mm

Update copyright information in several ZFS files, as the clause 3.3
of the CDDL licence explicitly requires every Contributor to add
a copyright notice.

This also reflects the copyright notices for the changes recently
added by Illumos.

MFC after: 3 days


226707 24-Oct-2011 pjd

- Use better naming now that we allow to rename any mounted file system (not
only legacy).
- Update copyright to include myself.

MFC after: 2 weeks


226700 24-Oct-2011 pjd

Don't forget to rename mounted snapshots of the file system being renamed.

MFC after: 2 weeks


226678 24-Oct-2011 pjd

Include <sys/zfs_vfsops.h> only when compiling kernel module.

MFC after: 2 weeks


226676 24-Oct-2011 pjd

Allow to rename file systems without remounting if it is possible.
It is possible for file systems with 'mountpoint' preperty set to 'legacy'
or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even mounted back.

This introduces layering violation, as we need to update 'f_mntfromname'
field in statfs structure related to mountpoint (for the dataset we are
renaming and all its children).

In my opinion it is worth it, as it allow to update FreeBSD in even cleaner
way - in ZFS-only configuration root file system is ZFS file system with
'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs,
we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs),
update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs
and rename it back to system/rootfs while it is mounted as /. Before it was
not possible, because unmounting / was not possible.

MFC after: 2 weeks


226620 21-Oct-2011 pjd

Update per-thread I/O statistics collection in ZFS.
This allows to see processes I/O activity in 'top -m io' output.

PR kern/156218
Reported by: Marcus Reid <marcus@blazingdot.com>
Patch by: avg
MFC after: 3 days


226617 21-Oct-2011 pjd

zfs vdev_file_io_start: validate vdev before using vdev_tsd

vdev_tsd can be NULL for certain vdev states.
At least in userland testing with ztest.

Submitted by: avg
MFC after: 3 days


226568 20-Oct-2011 pjd

- Correctly read gang header from raidz.
- Decompress assembled gang block data if compressed.
- Verify checksum of a gang header.
- Verify checksum of assembled gang block data.
- Verify checksum of uber block.

Submitted by: avg
MFC after: 3 days


226553 19-Oct-2011 pjd

Always pass data size for checksum verification function, as using
physical block size declared in bp may not always be what we want.
For example in case of gang block header physical block size declared
in bp is much larger than SPA_GANGBLOCKSIZE (512 bytes) and checksum
calculation failed. This bug could lead to accessing unallocated
memory and resets/failures during boot.

MFC after: 3 days


226550 19-Oct-2011 pjd

Initialize 'rc' properly before using it. This error could lead to infinite
loop when data reconstruction was needed.

MFC after: 3 days


226549 19-Oct-2011 pjd

Remove redundant size calculation.

MFC after: 3 days


226512 18-Oct-2011 mm

Import fix for Illumos bug #1475 to reduce diff against upstream.

Panic caused by this bug was already partially fixed by pjd@
in p4 CH 185940 and 185942.

Reference:
1475 zfs spill block hold can access invalid spill blkptr
https://www.illumos.org/issues/1475

Reviewed by: delphij
Obtained from: Illumos (issue 1475, changeset 13469:b8e89e5c4167)
MFC after: 1 week


226483 17-Oct-2011 delphij

Fix a bug in sa_find_sizes() which could lead to panic:

When calculating space needed for SA_BONUS buffers,
hdrsize is always rounded up to next 8-aligned boundary.
However, in two places the round up was done against
sum of 'total' plus hdrsize. On the other hand,
hdrsize increments by 4 each time, which means in
certain conditions, we would end up returning with
will_spill == 0 and (total + hdrsize) larger than
full_space, leading to a failed assertion because
it's invalid for dmu_set_bonus.

Sponsored by: iXsystems, Inc.
Reviewed by: mm
MFC after: 3 days


226452 16-Oct-2011 marcel

Define dtrace_cmpset_long in terms of atomic_cmpset_long
and not by virtue of inline assembly. Now this file
compiles on all supported architectures.


225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


225531 13-Sep-2011 avg

zfs boot subroutines: correctly specify type of an integer literal

Found by adding more warning flags to zfs boot blocks build.

Approved by: re (kib)
MFC after: 1 week


225418 06-Sep-2011 kib

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


225166 25-Aug-2011 mm

Generalize ffs_pages_remove() into vn_pages_remove().

Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week


225153 24-Aug-2011 pjd

We need to unlock and destroy vnode attached to znode which we are freeing.

Reviewed by: kib
Approved by: re (bz)
MFC after: 1 week


224855 13-Aug-2011 mm

zfs_ioctl.c: improve code readability in zfs_ioc_dataset_list_next()

zvol.c: fix calling of dmu_objset_prefetch() in zvol_create_minors()
by passing full instead of relative dataset name and prefetching all
visible datasets to be processed later instead of just the pool name

Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.

M opensolaris/uts/common/fs/zfs/zfs_ioctl.c
M opensolaris/uts/common/fs/zfs/zvol.c


224814 13-Aug-2011 mm

Fix race between dmu_objset_prefetch() invoked from
zfs_ioc_dataset_list_next() and dsl_dir_destroy_check() indirectly
invoked from dmu_recv_existing_end() via dsl_dataset_destroy() by not
prefetching temporary clones, as these count as always inconsistent.
In addition, do not prefetch hidden datasets at all as we are not
going to process these later.

Filed as Illumos Bug #1346

PR: kern/157728
Tested by: Borja Marcos <borjam@sarenet.es>, mm
Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week


224791 12-Aug-2011 pjd

Eliminate the zfsdev_state_lock entirely and replace it with the
spa_namespace_lock. This fixes LOR between the spa_namespace_lock and
spa_config lock. LOR can cause deadlock on vdevs removal/insertion.

Reported by: gibbs, delphij
Tested by: delphij
Approved by: re (kib)
MFC after: 1 week


224778 11-Aug-2011 rwatson

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


224605 02-Aug-2011 mm

Fix panic in zfs_read() if IO_SYNC flag supplied by checking for
zfsvfs->z_log before calling zil_commit(). [1]
Do not call zfs_read() from zfs_getextattr() with the IO_SYNC flag.

Submitted by: Alexander Zagrebin <alex@zagrebin.ru> [1]
Reviewed by: pjd@
Approved by: re (kib)
MFC after: 3 days


224579 01-Aug-2011 mm

Fix integer overflow in txg_delay() by initializing
the variable "timeout" as clock_t.

Filed as Illumos Bug #1313

Reviewed by: avg
Approved by: re (kib)
MFC after: 3 days


224526 30-Jul-2011 mm

Fix serious bug in ZIL that can lead to pool corruption
in the case of a held dataset during remount.

Detailed description is available at:
https://www.illumos.org/issues/883

illumos-gate revision: 13380:161b964a0e10

Reviewed by: pjd
Approved by: re (kib)
Obtained from: Illumos (Bug #883)
MFC after: 3 days


224252 21-Jul-2011 delphij

Bring the code more in-line with OpenSolaris source to
ease future port.

Reviewed by: pjd, mm
Approved by: re (kib)


224251 21-Jul-2011 delphij

A different implementation of r224231 proposed by pjd@,
which does not require change in the znode structure.
Specifically, it queries rdev from the znode in the
same sa_bulk_lookup already done in zfs_getattr().

Submitted by: pjd (with some revisions)
Reviewed by: pjd, mm
Approved by: re (kib)


224231 20-Jul-2011 delphij

Add a new field to in-core znode, z_rdev, to represent device nodes.

PR: kern/159010
Reviewed by: mm@
Approved by: re (kib)
MFC after: 2 weeks


224177 18-Jul-2011 mm

ZFS tries to allocate blocks evenly across all devices. This means when
devices are imbalanced zfs will lots of CPU searching for space on devices
which tend to be pretty full. It should instead fail quickly on the full
devices and move onto devices which have more availability.

New loader tunable: vfs.zfs.mg_alloc_failures (min = 8)

Illumos-gate changeset: 13379:4df42cc92254

Obtained from: Illumos (Bug #1051)
MFC after: 2 weeks


224174 18-Jul-2011 mm

Resurrect the ZFS "aclmode" property
Change default of "aclmode" to "discard".

Illumos-gate changeset: 13370:8c04143bd318

Obtained from: Illumos (Feature #742)
MFC after: 2 weeks


223758 04-Jul-2011 attilio

With retirement of cpumask_t and usage of cpuset_t for representing a
mask of CPUs, pc_other_cpus and pc_cpumask become highly inefficient.

Remove them and replace their usage with custom pc_cpuid magic (as,
atm, pc_cpumask can be easilly represented by (1 << pc_cpuid) and
pc_other_cpus by (all_cpus & ~(1 << pc_cpuid))).

This change is not targeted for MFC because of struct pcpu members
removal and dependency by cpumask_t retirement.

MD review by: marcel, marius, alc
Tested by: pluknet
MD testing by: marcel, marius, gonzo, andreast


223623 28-Jun-2011 mm

Add a new "REFCOMPRESSRATIO" property.

For snapshots, this is the same as COMPRESSRATIO, but for
filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie,
includes blocks in children, but not blocks shared with the origin).

This is needed to figure out how much space a filesystem would use if it
were not compressed (ignoring snapshots).

Illumos-gate revision: 13387

Obtained from: Illumos (Feature #1092)
MFC after: 2 weeks


223622 28-Jun-2011 mm

Disable vdev cache (readahead) by default.

The vdev cache is very underutilized (hit ratio 30%-70%) and may consume
excessive memory on systems with many vdevs.

Illumos-gate revision: 13346

Obtained from: Illumos (Bug #175)
MFC after: 1 week


223262 18-Jun-2011 benl

Fix clang warnings.

Approved by: philip (mentor)


222950 10-Jun-2011 gibbs

Remove C constructs that are incompatible with C++ from various
OpenSolaris and ZFS header files. These changes are sufficient
to allow a C++ program to use the libzfs library.

Note: The majority of these files already included 'extern "C"'
declarations, so the intention of providing C++ compatibility
already existed even if it wasn't provided.

cddl/compat/opensolaris/include/assert.h:
Wrap our compatibility assert implementation in
'extern "C"'. Since this is a compatibility header
I matched the Solaris style of doing this explicitly
rather than rely on FreeBSD's __BEGIN/END_DECLS macro.

sys/cddl/compat/opensolaris/sys/kstat.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/arc.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/ddt.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h:
Rename parameters in function declarations that conflict
with C++ keywords. This was the solution preferred by
members of the Illumos community.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h:
In C, nested structures are visible in the global namespace,
but in C++, they take on the namespace of the structure in
which they are contained. Flatten nested structure
definitions within struct zfs_cmd so these structures are
visible in the global namespace when compiled in both
languages.

Sponsored by: Spectra Logic Corporation


222835 07-Jun-2011 mm

Silence notice on pool creation, import and access.

Suggested by: Jeremy Chadwick (freebsd-stable@)
Discussed with: pjd
MFC after: 1 week


222813 07-Jun-2011 attilio

etire the cpumask_t type and replace it with cpuset_t usage.

This is intended to fix the bug where cpu mask objects are
capped to 32. MAXCPU, then, can now arbitrarely bumped to whatever
value. Anyway, as long as several structures in the kernel are
statically allocated and sized as MAXCPU, it is suggested to keep it
as low as possible for the time being.

Technical notes on this commit itself:
- More functions to handle with cpuset_t objects are introduced.
The most notable are cpusetobj_ffs() (which calculates a ffs(3)
for a cpuset_t object), cpusetobj_strprint() (which prepares a string
representing a cpuset_t object) and cpusetobj_strscan() (which
creates a valid cpuset_t starting from a string representation).
- pc_cpumask and pc_other_cpus are target to be removed soon.
With the moving from cpumask_t to cpuset_t they are now inefficient
and not really useful. Anyway, for the time being, please note that
access to pcpu datas is protected by sched_pin() in order to avoid
migrating the CPU while reading more than one (possible) word
- Please note that size of cpuset_t objects may differ between kernel
and userland. While this is not directly related to the patch itself,
it is good to understand that concept and possibly use the patch
as a reference on how to deal with cpuset_t objects in userland, when
accessing kernland members.
- KTR_CPUMASK is changed and now is represented through a string, to be
set as the example reported in NOTES.

Please additively note that no MAXCPU is bumped in this patch, but
private testing has been done until to MAXCPU=128 on a real 8x8x2(htt)
machine (amd64).

Please note that the FreeBSD version is not yet bumped because of
the upcoming pcpu changes. However, note that this patch is not
targeted for MFC.

People to thank for the time spent on this patch:
- sbruno, pluknet and Nicholas Esborn (nick AT desert DOT net) tested
several revision of the patches and really helped in improving
stability of this work.
- marius fixed several bugs in the sparc64 implementation and reviewed
patches related to ktr.
- jeff and jhb discussed the basic approach followed.
- kib and marcel made targeted review on some specific part of the
patch.
- marius, art, nwhitehorn and andreast reviewed MD specific part of
the patch.
- marius, andreast, gonzo, nwhitehorn and jceel tested MD specific
implementations of the patch.
- Other people have made contributions on other patches that have been
already committed and have been listed separately.

Companies that should be mentioned for having participated at several
degrees:
- Yahoo! for having offered the machines used for testing on big
count of CPUs.
- The FreeBSD Foundation for having sponsored my devsummit attendance,
which has been instrumental.
- Sandvine for having offered offices and infrastructure during
development.

(I really hope I didn't forget anyone, if it happened I apologize in
advance).


222757 06-Jun-2011 mm

Remove empty #ifndef

MFC after: 3 days


222670 04-Jun-2011 avg

opensolaris compat / zfs: avoid early overflow in ddi_get_lbolt*

Reported by: David P. Discher <dpd@bitgravity.com>
Tested by: will
Reviewed by: art
Discussed with: dwhite
MFC after: 2 weeks


222518 31-May-2011 pjd

Imagine situation where a security problem is found in setuid binary.
User upgrades his system to fix the problem, but if he has any ZFS snapshots
for the file system which contains problematic binary, any user can mount the
snapshot and execute vulnerable binary.

Prevent this from happening by always mounting snapshots with setuid turned off.

MFC after: 2 weeks


222343 27-May-2011 pjd

Silence warnings about unsupoorted value types.

MFC after: 2 weeks


222268 24-May-2011 pjd

Don't pass pointer to name buffer which is on the stack to another thread,
because the stack might be paged out once the other thread tries to use the
data. Instead, just allocate memory.

MFC after: 2 weeks


222267 24-May-2011 pjd

Don't access task structure once we call task function.
The task structure might be no longer available.
This also allows to eliminates the need for two tasks in the zio structure.

Submitted by: anonymous
MFC after: 2 weeks


222199 22-May-2011 rmacklem

Fix the zfs file system so that it uses the lock
flags argument added to VFS_FHTOVP() by r222167.

Reviewed by: pjd


222167 22-May-2011 rmacklem

Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by: kib


222050 18-May-2011 mm

Restore old (v15) behaviour for a recursive snapshot destroy.
(zfs destroy -r pool/dataset@snapshot)

To destroy all descendent snapshots with the same name the top level
snapshot was not required to exist. So if the top level snapshot does
not exist, check permissions of the parent dataset instead.

Filed as Illumos Bug #1043

Reviewed by: delphij
Approved by: pjd
MFC after: together with v28


221991 16-May-2011 avg

Revert accidentally committed local change in r221990

Pointyhat to: avg


221990 16-May-2011 avg

better integrate cyclic module with clocksource/eventtimer subsystem

Now in the case when one-shot timers are used cyclic events should fire
closer to theier scheduled times. As the cyclic is currently used only
to drive DTrace profile provider, this is the area where the change
makes a difference.

Reviewed by: mav (earlier version, a while ago)
X-MFC after: clocksource/eventtimer subsystem


221740 10-May-2011 avg

dtrace: remove unused code

Which is also useless, IMO.

MFC after: 5 days


221409 03-May-2011 marius

Convert the last use of xcopyout() to ddi_copyout() and remove the now
unused xcopyin() as well as xcopyout().
MFC together with r219089.

Approved by: mm


221263 30-Apr-2011 mm

Fix deduplicated zfs receive
(dmu_recv_stream builds incomplete guid_to_ds_map)

Illumos-gate changeset: 13329:c48b8bf84ab7
MFC together with v28

Approved by: pjd
Obtained from: Illumos (Bug #755)


221112 27-Apr-2011 marcel

Fix copy-paste bug.


220447 08-Apr-2011 mm

Partially fix ZFS compat code for sparc64.
Some endianess bugs still need to be resolved.

Submitted by: marius (parts of the fix)
MFC after: 1 month


220437 08-Apr-2011 art

Stripped '32' suffix from linux systrace module name on i386.

Approved by: avg


220433 07-Apr-2011 jkim

Use atomic load & store for TSC frequency. It may be overkill for amd64 but
safer for i386 because it can be easily over 4 GHz now. More worse, it can
be easily changed by user with 'machdep.tsc_freq' tunable (directly) or
cpufreq(4) (indirectly). Note it is intentionally not used in performance
critical paths to avoid performance regression (but we should, in theory).
Alternatively, we may add "virtual TSC" with lower frequency if maximum
frequency overflows 32 bits (and ignore possible incoherency as we do now).


219973 24-Mar-2011 pjd

Checking file access on size change is bogus. The checks are done earlier by
VFS where we know if this is truncate(2) or ftruncate(2). If this is the
latter we should depend on the mode the file was opened and not on the current
permission.

PR: standards/154873
Reported by: Mark Martinec <Mark.Martinec@ijs.si>
Discussed with: Eric Schrock <eric.schrock@delphix.com>
Discussed with: Mark Maybee <Mark.Maybee@Oracle.COM>
MFC after: 1 month


219636 14-Mar-2011 pjd

Fix potential panic in dbuf_sync_list() relate to spill blocks handling.

Obtained from: IllumOS
MFC after: 1 month


219561 12-Mar-2011 avg

add DTrace systrace support for linux32 and freebsd32 on amd64 syscalls

Add systrace_linux32 and systrace_freebsd32 modules which provide
support for tracing compat system calls in addition to native system
call tracing provided by systrace module.

Provided that all the systrace modules are loaded now you can select
what syscalls to trace in the following manner:

syscall::xxx:yyy - work on all system calls that match the specification
syscall:freebsd:xxx:yyy - only native system calls
syscall:linux32:xxx:yyy - linux32 compat system calls
syscall:freebsd32:xxx:yyy - freebsd32 compat system calls on amd64

PR: kern/152822
Submitted by: Artem Belevich <fbsdlist@src.cx>
Reviewed by: jhb (earlier version)
MFC after: 3 weeks


219404 08-Mar-2011 pjd

Correct readdir over ZFS handling.

Reported by: Pierre Beyssac <pb@fasterix.frmug.org>
MFC after: 1 month


219320 06-Mar-2011 pjd

Fix libzpool build.

MFC after: 1 month


219317 05-Mar-2011 pjd

Make renaming of a ZVOL, ZVOL's parent directory and ZVOL snapshot work.

Reported by: avg
MFC after: 1 month


219316 05-Mar-2011 pjd

Simplify zvol_remove_minors() a bit.

MFC after: 1 month


219092 28-Feb-2011 pjd

Use proper lock in assertion.

MFC after: 1 month


219089 27-Feb-2011 pjd

Finally... Import the latest open-source ZFS version - (SPA) 28.

Few new things available from now on:

- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
transaction group.
- Possibility to import pool in read-only mode.

MFC after: 1 month


218909 21-Feb-2011 brucec

Fix typos - remove duplicate "the".

PR: bin/154928
Submitted by: Eitan Adler <lists at eitanadler.com>
MFC after: 3 days


218665 13-Feb-2011 marcel

Use the preload_fetch_addr() and preload_fetch_size() convenience
functions to obtain the address and size of the preloaded pool
configuration file/repository.

Sponsored by: Juniper Networks.


218550 11-Feb-2011 kib

For UIO_NOCOPY case of reading request on zfs vnode, which has vm object
attached, activate the page after the successful read, and free the page
if read was unsuccessfull.

Freshly allocated page is not on any queue yet, and not activating (or
deactivating) the page leaves it on no queue, excluding the page from
pagedaemon scans and making the memory disappeared until the vnode
reclaimed.

Reviewed by: avg
MFC after: 1 week


218386 06-Feb-2011 trasz

Make it impossible to clear the MNT_NFS4ACLS flag on ZFS filesystem
by using "mount -uw".

Reviewed by: pjd
MFC after: 2 weeks


218278 04-Feb-2011 ae

vdev's sectorsize should not be greater than 8 Kbytes and also
it should be power of 2. This prevents non-aligned access while
probing vdev's labels.

PR: kern/147852
Reviewed by: pjd
MFC after: 1 week


218180 01-Feb-2011 mm

Recommit r218169, enclosing with #ifdef _KERNEL
This change is sufficient for the ZFS kernel module.

Discussed with: pjd
MFC after: 1 week


218177 01-Feb-2011 kan

Revert r218169 until it can be tested and fixed properly.


218169 01-Feb-2011 mm

For ZFS, change the type of clock_t to int64_t.

The clock_t type in OpenSolaris is long (int64_t on amd64).
On FreeBSD clock_t is int32_t. The clock_t type is used in several places
in the ZFS code to store system uptime in milliseconds ("seconds * hz").

With hz=1000 we have a 32-bit integer overflow in 24 days, 20 hours,
31 minutes and 23.648 seconds. This has a user reported negative impact
on l2arc_feed_thread() and may cause unexpected results from other functions
using clock_t.

Reported by: Artem Belevich <fbsdlist@src.cx> on freebsd-fs@
MFC after: 1 week


218007 28-Jan-2011 jchandra

CDDL fixes for MIPS n32.

Provide 64 bit atomic ops, and use 32 bit pointer.


217616 19-Jan-2011 mdf

Introduce signed and unsigned version of CTLTYPE_QUAD, renaming
existing uses. Rename sysctl_handle_quad() to sysctl_handle_64().


217588 19-Jan-2011 trasz

Add MNT_NFS4ACLS to ZFS mount flags. It's not conditional, since there
is no way to disable NFSv4 ACLs in ZFS. This should make it easier
for the NFS server to figure out whether the exported filesystem supports
ACLs or not.

Reviewed by: pjd
MFC after: 2 weeks


217367 13-Jan-2011 mdf

Re-commit the zfs sysctl(9) type-safety changes.

Thanks to dim and pjd for the pointer to zfs_context.h for building
userland.


217332 12-Jan-2011 mdf

Revert cddl changes for sysctl(9) until I understand why this isn't
building on universe.


217319 12-Jan-2011 mdf

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the zfs piece.


216919 03-Jan-2011 mm

MFp4 r186485, r186859:

Fix a race by defining two tasks in the zio structure
as we can still be returning from issue task when interrupt task is used.

Tested by: pjd
Approved by: pjd, delphij (mentor)
MFC after: 3 days


216505 17-Dec-2010 avg

cyclic xcall: use smp_no_rendevous_barrier as setup function parameter

In this case we call target function only on a single CPU and do not
need any synchronization at the setup stage.

It's a bit non-obvious but setup function of NULL means that
smp_rendezvous_cpus waits for all CPUs to arrive at the rendezvous
point, but without doing any actual setup. While using
smp_no_rendevous_barrier means that each CPU proceeds on its own
schedule without any synchronization whatsoever.

MFC after: 3 weeks


216378 11-Dec-2010 pjd

Remove redundant semicolon and empty like.


216256 07-Dec-2010 ivoras

Undo r216230: the interaction between saved ashift in metadata and
detected ashift does not support this. With this change, pools
created while stripesize=512 could not be imported when stripesize
becomes larger (on the same drive).

Noticed by: pjd


216254 07-Dec-2010 avg

opensolaris cyclic: fix deadlock and make a little bit closer to upstream

The dealock was caused in the following way:
- thread T1 on CPU C1 holds a spin mutex, IPIs CPU C2 and waits for the
IPI to be handled
- C2 executes timer interrupt filter, thus has interrupts disabled, and
gets blocked on the spin mutex held by T1
The problem seems to have been introduced by simplifications made to
OpenSolaris code during porting.
The problem is fixed by reorganizing the code to more closely resemble
the upstream version. Interrupt filter (cyclic_fire) now doesn't
acquire any locks, all per-CPU data accesses are performed on a
target CPU with preemption and interrupts disabled thus precluding
concurrent access to the data.
cyp_mtx spin mutex is used to disable preemtion and interrupts; it's not
used for classical mutual exclusion, because xcall already serializes
calls to a CPU. It's an emulation of OpenSolaris
cyb_set_level(CY_HIGH_LEVEL) call, the spin mutexes could probably be
reduced to just a spinlock_enter()/_exit() pair.

Diff with upstream version is now reduced by ~500 lines, however it still
remains quite large - many things that are not needed (at the moment) or
are irrelevant on FreeBSD were simply ripped out during porting.
Examples of such things:
- support for CPU onlining/offlining
- support for suspend/resume
- support for running callouts at soft interrupt levels
- support for callout rebinding from CPU to CPU
- support for CPU partitions

Tested by: Artem Belevich <fbsdlist@src.cx>
MFC after: 3 weeks
X-MFC with: r216252


216252 07-Dec-2010 avg

opensolaris cyclic xcall: no need for special handling of curcpu

smp_rendezvous_cpus already properly handles current CPU case
and non-SMP case.

MFC after: 3 weeks


216251 07-Dec-2010 avg

dtrace_xcall: no need for special handling of curcpu

smp_rendezvous_cpus alreadt does the right thing in a very similar
fashion, so the code was kind of duplicating that.

MFC after: 3 weeks


216250 07-Dec-2010 avg

dtrace_gethrtime_init: pin to master while examining other CPUs

Also use pc_cpumask to be future-friendly.

Reviewed by: jhb
MFC after: 2 weeks


216230 06-Dec-2010 ivoras

Use GEOM stripesize field when calculating ashift. This will enable correct
alignment on drives with large sector sizes (e.g. 4 KiB) but the
implementation might need to be revisited if devices with large stripesizes
appear (e.g. if RAID controllers or flash drives start using the field),
probably by introducing a physsectorsize field in GEOM providers.

Discussed with: mav, mostly silence on freebsd-geom@ and freebsd-fs@


216084 30-Nov-2010 trasz

Don't panic when we read an empty ACL from ZFS. Apparently this may happen
with filesystems created under MacOS X ZFS port. This is kind of filesystem
corruption (we don't allow for setting empty ACLs), so make acl_get_file(3)
and related syscalls fail with EINVAL in that case. In theory, we could
return empty ACL to userland, but I'm afraid this would break some code.

MFC after: 3 days


215681 22-Nov-2010 jhb

Remove some bogus, self-referential mergeinfo.


215401 16-Nov-2010 avg

zfs+sendfile: populate all requested pages, not just those already cached

kern_sendfile() uses vm_rdwr() to read-ahead blocks of data to populate
page cache. When sendfile stumbles upon a page that is not populated
yet, it sends out all the mbufs that it collected so far. This
resulted in very poor performance with ZFS when file data is not in the
page cache, because ZFS vop_read for UIO_NOCOPY case populated only
those pages that are already in cache, but not valid. Which means that
most of the time it populated only the first requested page in the
described above scenario.

Reported by: Alexander Zagrebin <alexz@visp.ru>
Tested by: Alexander Zagrebin <alexz@visp.ru>,
Artemiev Igor <ai@kliksys.ru>
MFC after: 12 days


215397 16-Nov-2010 avg

fix misspelling in a comment

Reported by: Daniel Braniss <danny@cs.huji.ac.il>
MFC after: 3 days


215260 13-Nov-2010 mm

Disable VFS_HOLD placed on mnt_vnodecovered during the mount of a snapshot
and VFS_RELE on a non-existing hold on snapshot parent's z_vfs.

This disables the changes from OpenSolaris onnv-revision 9234:bffdc4fc05c4
(bug IDs: 6792139, 6794830) - not applicable to FreeBSD.

This fixes the process hang if umounting a manually mounted snapshot.

Reported by: Alexander Zagrebin <alexz@visp.ru>
Approved by: delphij (mentor)
MFC after: 1 week


214854 05-Nov-2010 delphij

Validate whether the zfs_cmd_t submitted from userland is not smaller than
what we have. Without the check the kernel could accessing memory that
does not belong to the request struct.

Note that we do not test if the struct equals in size at this time, which
may faciliate forward compatibility with newer binaries.

Reviewed by: pjd at MeetBSD CA '2010
MFC after: 1 week


214378 26-Oct-2010 mm

Bugfix merge from OpenSolaris:

OpenSolaris onnv-revision: 10209:91f47f0e7728
6830541 zfs_get_data_trips on a verify
6696242 multiple zfs_fillpage() zfs: accessing past end of object panics
6785914 zfs fails to drop dn_struct_rwlock in recovery code path

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6830541, 6696242, 6785914)
MFC after: 2 weeks


213937 16-Oct-2010 avg

zfs: add vop_getpages method implementation

This should make vnode_pager_getpages path a bit shorter and clearer.
Also this should eliminate problems with partially valid pages.
Having this method opens room for future optimizations.

To do: try to satisfy other pages besides the required one taking into
account tradeofs between number of page faults, read throughput and read
latency. Also, eventually vop_putpages should be added too.

Reviewed by: kib, mm, pjd
MFC after: 3 weeks


213791 13-Oct-2010 rpaulo

Pass a format string to panic() and to taskqueue_start_threads().

Found with: clang


213790 13-Oct-2010 rpaulo

In zfs_post_common(), use %d instead of %hhu.

Found with: clang


213730 12-Oct-2010 avg

zfs + sendfile: do not produce partially valid pages for vnode's tail

Since r212650 and before this change sendfile(2) could produce
a partially valid page for a trailing portion of a ZFS vnode.
vm_fault() always wants to see a fully valid page even if it's the last
page that partially extends beyond vnode's end. Otherwise it calls
vop_getpages() to bring in the page. In the case of ZFS this means
that the data is read from the page into the same page and this breaks
checks in ZFS mappedread() - a thread that set VPO_BUSY on the page in
vm_fault() will get blocked forever waiting for it to be cleared.

Many thanks to Kai and Jeremy for reproducing the issue and providing
important debugging information and help.

Reported by: Kai Gallasch <gallasch@free.de>,
Jeremy Chadwick <freebsd@jdc.parodius.com>
Tested by: Kai Gallasch <gallasch@free.de>,
Jeremy Chadwick <freebsd@jdc.parodius.com>
Reviewed by: kib
MFC after: 3 days
To-Do: apply the same treatment to tmpfs + sendfile


213673 10-Oct-2010 pjd

Provide internal ioflags() function that converts ioflag provided by FreeBSD's
VFS to OpenSolaris-specific ioflag expected by ZFS. Use it for read and write
operations.

Reviewed by: mm
MFC after: 1 week


213634 08-Oct-2010 mm

Change FAPPEND to IO_APPEND as this is a ioflag and not a fflag.
This corrects writing to append-only files on ZFS.

PR: kern/149495 [1], kern/151082 [2]
Submitted by: Daniel Zhelev <daniel@zhelev.biz> [1], Michael Naef <cal@linu.gs> [2]
Approved by: delphij (mentor)
MFC after: 1 week


213528 07-Oct-2010 avg

opensolaris_kmem kmem_size(): report lesser of vm_kmem_size and available
physical memory

This is needed to correctly autotune ZFS ARC size when vm_kmem_size is
set to value larger than available physical memory.

MFC after: 2 weeks


213198 27-Sep-2010 mm

Properly handle IO with B_FAILFAST
Retry IO once with ZIO_FLAG_TRYHARD before declaring a pool faulted

OpenSolaris revision and Bug IDs:

9725:0bf7402e8022
6843014 ZFS B_FAILFAST handling is broken

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6843014)
MFC after: 3 weeks


213197 27-Sep-2010 mm

Enable offlining of log devices.

OpenSolaris revision and Bug IDs:

9701:cc5b64682e64
6803605 should be able to offline log devices
6726045 vdev_deflate_ratio is not set when offlining a log device
6599442 zpool import has faults in the display

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6803605, 6726045, 6599442)
MFC after: 3 weeks


212951 21-Sep-2010 avg

zfs_map_page/zfs_unmap_page: do not use sched_pin() and SFB_CPUPRIVATE

zfs_map_page/zfs_unmap_page are mostly called around potential I/O paths
and it seems to be a not very good idea to do cpu pinning there.

Suggested by: kib
MFC after: 2 weeks


212950 21-Sep-2010 avg

zfs_vnops: use zfs_map_page/zfs_unmap_page helper functions in another place

MFC after: 2 weeks


212783 17-Sep-2010 avg

zfs arc_reclaim_needed: fix typo in mismerge in r212780

PR: kern/146410, kern/138790
MFC after: 3 weeks
X-MFC with: r212780


212782 17-Sep-2010 avg

zfs+sendfile: advance uio_offset upon reading as well

Picked from analogous code in tmpfs.

MFC after: 1 week


212781 17-Sep-2010 avg

zfs arc_reclaim_needed: remove redundant checks for arc_c_max and arc_c_max

Those checks are not present in upstream code and they are enforced in
actual calculations of delta by which ARC size can be grown or should be
reduced.

MFC after: 3 weeks


212780 17-Sep-2010 avg

zfs arc_reclaim_needed: more reasonable threshold for available pages

vm_paging_target() is not a trigger of any kind for pageademon, but
rather a "soft" target for it when it's already triggered.
Thus, trying to keep 2048 pages above that level at the expense of ARC
was simply driving ARC size into the ground even with normal memory
loads.
Instead, use a threshold at which a pagedaemon scan is triggered, so
that ARC reclaiming helps with pagedaemon's task, but the latter still
recycles active and inactive pages.

PR: kern/146410, kern/138790
MFC after: 3 weeks


212694 15-Sep-2010 mm

Fix kernel panic when moving a file to .zfs/shares
Fix possible loss of correct error return code in ZFS mount

OpenSolaris revisions and Bug IDs:

11824:53128e5db7cf
6863610 ZFS mount can lose correct error return

12079:13822b941977
6939941 problem with moving files in zfs (142901-12)

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6863610, 6939941)
MFC after: 3 days


212657 15-Sep-2010 avg

zfs vn_has_cached_data: take into account v_object->cache != NULL

This mirrors code in tmpfs.
This changge shouldn't affect much read path, it may cause unnecessary
vm_page_lookup calls in the case where v_object has no active or inactive
pages but has some cache pages. I believe this situation to be non-essential.

In write path this change should allow us to properly detect the above
case and free a cache page when we write to a range that corresponds to it.
If this situation is undetected then we could have a discrepancy between
data in page cache and in ARC or on disk.

This change allows us to re-enable vn_has_cached_data() check in zfs_write.

NOTE: strictly speaking resident_page_count and cache fields of v_object
should be exmined under VM_OBJECT_LOCK, but for this particular usage
we may get away with it.

Discussed with: alc, kib
Approved by: pjd
Tested with: tools/regression/fsx
MFC after: 3 weeks


212655 15-Sep-2010 avg

zfs mappedread, update_pages: use int for offset and length within a page

uint64_t, int64_t were redundant there

Approved by: pjd
Tested by: tools/regression/fsx
MFC after: 2 weeks


212654 15-Sep-2010 avg

zfs mappedread: use uiomove_fromphys where possible

Reviewed by: alc
Approved by: pjd
Tested by: tools/regression/fsx
MFC after: 2 weeks


212652 15-Sep-2010 avg

zfs: catch up with vm_page_sleep_if_busy changes

Reviewed by: alc
Approved by: pjd
Tested by: tools/regression/fsx
MFC after: 2 weeks


212650 15-Sep-2010 avg

tmpfs, zfs + sendfile: mark page bits as valid after populating it with data

Otherwise, adding insult to injury, in addition to double-caching of data
we would always copy the data into a vnode's vm object page from backend.
This is specific to sendfile case only (VOP_READ with UIO_NOCOPY).

PR: kern/141305
Reported by: Wiktor Niesiobedzki <bsd@vink.pl>
Reviewed by: alc
Tested by: tools/regression/sockets/sendfile
MFC after: 2 weeks


212611 14-Sep-2010 mm

Remove duplicated VFS_HOLD due to a mismerge.

PR: kern/150544
Approved by: delphij (mentor)
MFC after: 1 day


212605 14-Sep-2010 mm

Add missing vop_vector zfsctl_ops_shares
Add missing locks around VOP_READDIR and VOP_GETATTR with z_shares_dir

PR: kern/150544
Approved by: delphij (mentor)
Obtained from: perforce (pjd)
MFC after: 1 day


212573 13-Sep-2010 pjd

Remove the page queues lock around vm_page_undirty() - it is no longer needed.

Reviewed by: alc


212494 12-Sep-2010 rpaulo

Revamp locking a bit. This fixes three problems:
* processes now can't go away while we are inserting probes (fixes a panic)
* if a trap happens, we won't be holding the process lock (fixes a hang)
* fix a LOR between the process lock and the fasttrap bucket list lock

Thanks to kib for pointing some problems.
Sponsored by: The FreeBSD Foundation


212465 11-Sep-2010 rpaulo

Avoid a LOR (sleepable after non-sleepable) in
fasttrap_tracepoint_enable().

Sponsored by: The FreeBSD Foundation


212425 10-Sep-2010 mdf

Replace sbuf_overflowed() with sbuf_error(), which returns any error
code associated with overflow or with the drain function. While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.


212407 10-Sep-2010 pjd

Forgot to commit this file. Add ZPOOL_CONFIG_IS_LOG.

Reported by: keramida
MFC after: 2 weeks


212385 09-Sep-2010 pjd

On FreeBSD we can log from pool that have multiple top-level vdevs or log
vdevs, so don't deny adding new vdevs if bootfs property is set.

MFC after: 2 weeks


212357 09-Sep-2010 rpaulo

Fix two bugs in DTrace:
* when the process exits, remove the associated USDT probes
* when the process forks, duplicate the USDT probes.

Sponsored by: The FreeBSD Foundation


212160 02-Sep-2010 gibbs

Correct bioq_disksort so that bioq_insert_tail() offers barrier semantic.
Add the BIO_ORDERED flag for struct bio and update bio clients to use it.

The barrier semantics of bioq_insert_tail() were broken in two ways:

o In bioq_disksort(), an added bio could be inserted at the head of
the queue, even when a barrier was present, if the sort key for
the new entry was less than that of the last queued barrier bio.

o The last_offset used to generate the sort key for newly queued bios
did not stay at the position of the barrier until either the
barrier was de-queued, or a new barrier (which updates last_offset)
was queued. When a barrier is in effect, we know that the disk
will pass through the barrier position just before the
"blocked bios" are released, so using the barrier's offset for
last_offset is the optimal choice.

sys/geom/sched/subr_disk.c:
sys/kern/subr_disk.c:
o Update last_offset in bioq_insert_tail().

o Only update last_offset in bioq_remove() if the removed bio is
at the head of the queue (typically due to a call via
bioq_takefirst()) and no barrier is active.

o In bioq_disksort(), if we have a barrier (insert_point is non-NULL),
set prev to the barrier and cur to it's next element. Now that
last_offset is kept at the barrier position, this change isn't
strictly necessary, but since we have to take a decision branch
anyway, it does avoid one, no-op, loop iteration in the while
loop that immediately follows.

o In bioq_disksort(), bypass the normal sort for bios with the
BIO_ORDERED attribute and instead insert them into the queue
with bioq_insert_tail(). bioq_insert_tail() not only gives
the desired command order during insertion, but also provides
barrier semantics so that commands disksorted in the future
cannot pass the just enqueued transaction.

sys/sys/bio.h:
Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.

sys/cam/ata/ata_da.c:
sys/cam/scsi/scsi_da.c
Use an ordered command for SCSI/ATA-NCQ commands issued in
response to bios with the BIO_ORDERED flag set.

sys/cam/scsi/scsi_da.c
Use an ordered tag when issuing a synchronize cache command.

Wrap some lines to 80 columns.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
sys/geom/geom_io.c
Mark bios with the BIO_FLUSH command as BIO_ORDERED.

Sponsored by: Spectra Logic Corporation
MFC after: 1 month


212093 01-Sep-2010 rpaulo

Make the /dev/dtrace/helper node have the mode 0660. This allows
programs that refuse to run as root (pgsql) to install probes when their
user is part of the wheel group.

Sponsored by: The FreeBSD Foundation
> Description of fields to fill in above: 76 columns --|
> PR: If a GNATS PR is affected by the change.
> Submitted by: If someone else sent in the change.
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.

M dev/dtrace/dtrace_load.c


212002 30-Aug-2010 jh

execve(2) has a special check for file permissions: a file must have at
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.

Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.

PR: kern/125009
Reviewed by: bde, trasz
Approved by: pjd (ZFS part)


211948 28-Aug-2010 pjd

Return NULL pointer instead of B_FALSE as it is done in the vendor code.

Obtained from: //depot/user/pjd/zfs/...


211947 28-Aug-2010 pjd

Move ZUT_OBJS in the same place that is used in vendor code.

Obtained from: //depot/user/pjd/zfs/...


211932 28-Aug-2010 mm

Import changes from OpenSolaris that provide
- better ACL caching and speedup of ACL permission checks
- faster handling of stat()
- lowered mutex contention in the read/writer lock (rrwlock)
- several related bugfixes

Detailed information (OpenSolaris onnv changesets and Bug IDs):

9749:105f407a2680
6802734 Support for Access Based Enumeration (not used on FreeBSD)
6844861 inconsistent xattr readdir behavior with too-small buffer

9866:ddc5f1d8eb4e
6848431 zfs with rstchown=0 or file_chown_self privilege allows user to "take" ownership

9981:b4907297e740
6775100 stat() performance on files on zfs should be improved
6827779 rrwlock is overly protective of its counters

10143:d2d432dfe597
6857433 memory leaks found at: zfs_acl_alloc/zfs_acl_node_alloc
6860318 truncate() on zfsroot succeeds when file has a component of its path set without access permission

10232:f37b85f7e03e
6865875 zfs sometimes incorrectly giving search access to a dir

10250:b179ceb34b62
6867395 zpool_upgrade_007_pos testcase panic'd with BAD TRAP: type=e (#pf Page fault)

10269:2788675568fd
6868276 zfs_rezget() can be hazardous when znode has a cached ACL

10295:f7a18a1e9610
6870564 panic in zfs_getsecattr

Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 weeks


211931 28-Aug-2010 mm

Update ZFS metaslab code from OpenSolaris.
This provides a noticeable write speedup, especially on pools with
less than 30% of free space.

Detailed information (OpenSolaris onnv changesets and Bug IDs):

11146:7e58f40bcb1c
6826241 Sync write IOPS drops dramatically during TXG sync
6869229 zfs should switch to shiny new metaslabs more frequently

11728:59fdb3b856f6
6918420 zdb -m has issues printing metaslab statistics

12047:7c1fcc8419ca
6917066 zfs block picking can be improved

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6826241, 6869229, 6918420, 6917066)
MFC after: 2 weeks


211929 28-Aug-2010 rpaulo

Remove debugging.

Sponsored by: The FreeBSD Foundation


211925 28-Aug-2010 rpaulo

Replace a memory barrier with a mutex barrier.

Sponsored by: The FreeBSD Foundation


211900 27-Aug-2010 pjd

Use ZFS_CTLDIR_NAME instead of hardcoding ".zfs".


211855 26-Aug-2010 pjd

Update comment now that I finally committed r211854.

MFC after: 1 month


211762 24-Aug-2010 avg

zfs arc_reclaim_thread: no need to call arc_reclaim_needed when
resetting needfree

needfree is checked at the very start of arc_reclaim_needed.
This change makes code easier to follow and maintain in face of
potential changed in arc_reclaim_needed.

Also, put the whole sub-block under _KERNEL because needfree can be set
only in kernel code.

To do: rename needfree to something else to aovid confusion with
OpenSolaris global variable of the same name which is used in the same
code, but has different meaning (page deficit).

Note: I have an impression that locking around accesses to this variable
as well as mutual notifications between arc_reclaim_thread and
arc_lowmem are not proper.

MFC after: 1 week


211745 24-Aug-2010 rpaulo

Replace a pksignal() call with tdksignal().

Pointed out by: kib


211744 24-Aug-2010 rpaulo

MD fasttrap implementation.

Sponsored by: The FreeBSD Foundation


211738 24-Aug-2010 rpaulo

Port the fasttrap provider to FreeBSD. This provider is responsible for
injecting debugging probes in the userland programs and is the basis for
the pid provider and the usdt provider.

Sponsored by: The FreeBSD Foundation


211618 22-Aug-2010 rpaulo

Port this to FreeBSD. We miss some suword functions, so we use copyout.

Sponsored by: The FreeBSD Foundation
> Description of fields to fill in above: 76 columns --|
> PR: If a GNATS PR is affected by the change.
> Submitted by: If someone else sent in the change.
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.

M sys/fasttrap_impl.h


211611 22-Aug-2010 rpaulo

Destroy the helper device when unloading.

Sponsored by: The FreeBSD Foundation


211610 22-Aug-2010 rpaulo

Add more compatibility structure members needed by the upcoming fasttrap
DTrace device.

Sponsored by: The FreeBSD Foundation


211608 22-Aug-2010 rpaulo

Kernel DTrace support for:
o uregs (sson@)
o ustack (sson@)
o /dev/dtrace/helper device (needed for USDT probes)

The work done by me was:
Sponsored by: The FreeBSD Foundation


211607 22-Aug-2010 rpaulo

Add a function compatibility function dtrace_instr_size_isa() that on
FreeBSD does the same as dtrace_dis_isize().

Sponsored by: The FreeBSD Foundation


211606 22-Aug-2010 rpaulo

Add the FreeBSD definition for the fasttrap ioctls.

Sponsored by: The FreeBSD Foundation


211566 21-Aug-2010 rpaulo

Add a sysname char * to struct opensolaris_utsname.

Sponsored by: The FreeBSD Foundation


211555 21-Aug-2010 rpaulo

Port the DTrace helper ioctls to FreeBSD and add a helper member to
dof_helper_t (needed by drti.o).

Sponsored by: The FreeBSD Foundation


211553 21-Aug-2010 rpaulo

Add sysname to struct opensolaris_utsname. This is needed by one DTrace
test.

Sponsored by: The FreeBSD Foundation


211484 19-Aug-2010 imp

First cut at mips n64 ABI support


210999 07-Aug-2010 pjd

In FreeBSD we use 'jailed' property.

MFC after: 2 weeks


210470 25-Jul-2010 mm

Import two changesets from OpenSolaris to make future updates easier.

The changes do not affect FreeBSD code because
zfs_znode_move(), cleanlocks() and cleanshares() are not used.

OpenSolaris onnv changeset: 9788:f660bc44f2e8, 9909:aa280f585a3e

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6843700, 6790232)
MFC after: 7 weeks


210457 24-Jul-2010 mm

Consider snapshots as descendants via zfs allow -d

OpenSolaris onnv changeset: 9847:2f3ba86e857a

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6809340)
MFC after: 1 week


210427 23-Jul-2010 avg

zfs arc_memory_throttle: available memory is free + cache

OpenSolaris freemem has the same meaning as our v_free_count +
v_cache_count.

Obtained from: Artem Belevich <fbsdlist@src.cx>,
Peter Jeremy <peterjeremy@acm.org>
Discussed with: pjd
MFC after: 2 weeks


210398 22-Jul-2010 mm

Enable fake resolving of SMB RIDs by using nulldomain and UID_NOBODY
- fixes panics when Solaris/OpenSolaris pools that contain files
uploaded with the SMB protocol are accessed

Enable seting/unsetting the sharesmb property (dummy action)
- allows users who import pools from Solaris/Opensolaris to unset
the sharesmb property and get rid of annoying messages

PR: kern/145778, kern/148709
Approved by: pjd, delphij (mentor)
MFC after: 7 weeks


210282 20-Jul-2010 mm

To improve latency, lower default vfs.zfs.vdev.max_pending from 35 to 10

OpenSolaris onnv changeset (partial): 10801:e0bf032e8673

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6891731)
MFC after: 1 week


210193 17-Jul-2010 nwhitehorn

Add OpenSolaris atomics for powerpc64 and connect ZFS to the build on
this platform.

Reviewed by: pjd


210192 17-Jul-2010 nwhitehorn

Increase stack size for ZFS sync thread. This is required to make ZFS
function on 64-bit PowerPC.

Reviewed by: pjd
Obtained from: OpenSolaris changeset 14653:7cf402a7f374


210172 16-Jul-2010 jhb

Revert the previous commit. The race is not applicable to the lockmgr
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().

Discussed with: attilio


210171 16-Jul-2010 jhb

When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO. This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock. If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered. As a result
the thread blocked on the vnode lock may never get woken up. Fix this by
holding the vnode interlock while modifying the lock flags in this case.

MFC after: 3 days


209962 13-Jul-2010 mm

Merge ZFS version 15 and almost all OpenSolaris bugfixes referenced
in Solaris 10 updates 141445-09 and 142901-14.

Detailed information:
(OpenSolaris revisions and Bug IDs, Solaris 10 patch numbers)

7844:effed23820ae
6755435 zfs_open() and zfs_close() needs to use ZFS_ENTER/ZFS_VERIFY_ZP (141445-01)

7897:e520d8258820
6748436 inconsistent zpool.cache in boot_archive could panic a zfs root filesystem upon boot-up (141445-01)

7965:b795da521357
6740164 zpool attach can create an illegal root pool (141909-02)

8084:b811cc60d650
6769612 zpool_import() will continue to write to cachefile even if altroot is set (N/A)

8121:7fd09d4ebd9c
6757430 want an option for zdb to disable space map loading and leak tracking (141445-01)

8129:e4f45a0bfbb0
6542860 ASSERT: reason != VDEV_LABEL_REMOVE||vdev_inuse(vd, crtxg, reason, 0) (141445-01)

8188:fd00c0a81e80
6761100 want zdb option to select older uberblocks (141445-01)

8190:6eeea43ced42
6774886 zfs_setattr() won't allow ndmp to restore SUNWattr_rw (141445-01)

8225:59a9961c2aeb
6737463 panic while trying to write out config file if root pool import fails (141445-01)

8227:f7d7be9b1f56
6765294 Refactor replay (141445-01)

8228:51e9ca9ee3a5
6572357 libzfs should do more to avoid mnttab lookups (141909-01)
6572376 zfs_iter_filesystems and zfs_iter_snapshots get objset stats twice (141909-01)

8241:5a60f16123ba
6328632 zpool offline is a bit too conservative (141445-01)
6739487 ASSERT: txg <= spa_final_txg due to scrub/export race (141445-01)
6767129 ASSERT: cvd->vdev_isspare, in spa_vdev_detach() (141445-01)
6747698 checksum failures after offline -t / export / import / scrub (141445-01)
6745863 ZFS writes to disk after it has been offlined (141445-01)
6722540 50% slowdown on scrub/resilver with certain vdev configurations (141445-01)
6759999 resilver logic rewrites ditto blocks on both source and destination (141445-01)
6758107 I/O should never suspend during spa_load() (141445-01)
6776548 codereview(1) runs off the page when faced with multi-line comments (N/A)
6761406 AMD errata 91 workaround doesn't work on 64-bit systems (141445-01)

8242:e46e4b2f0a03
6770866 GRUB/ZFS should require physical path or devid, but not both (141445-01)

8269:03a7e9050cfd
6674216 "zfs share" doesn't work, but "zfs set sharenfs=on" does (141445-01)
6621164 $SRC/cmd/zfs/zfs_main.c seems to have a syntax error in the translation note (141445-01)
6635482 i18n problems in libzfs_dataset.c and zfs_main.c (141445-01)
6595194 "zfs get" VALUE column is as wide as NAME (141445-01)
6722991 vdev_disk.c: error checking for ddi_pathname_to_dev_t() must test for NODEV (141445-01)
6396518 ASSERT strings shouldn't be pre-processed (141445-01)

8274:846b39508aff
6713916 scrub/resilver needlessly decompress data (141445-01)

8343:655db2375fed
6739553 libzfs_status msgid table is out of sync (141445-01)
6784104 libzfs unfairly rejects numerical values greater than 2^63 (141445-01)
6784108 zfs_realloc() should not free original memory on failure (141445-01)

8525:e0e0e525d0f8
6788830 set large value to reservation cause core dump (141445-01)
6791064 want sysevents for ZFS scrub (141445-01)
6791066 need to be able to set cachefile on faulted pools (141445-01)
6791071 zpool_do_import() should not enable datasets on faulted pools (141445-01)
6792134 getting multiple properties on a faulted pool leads to confusion (141445-01)

8547:bcc7b46e5ff7
6792884 Vista clients cannot access .zfs (141445-01)

8632:36ef517870a3
6798384 It can take a village to raise a zio (141445-01)

8636:7e4ce9158df3
6551866 deadlock between zfs_write(), zfs_freesp(), and zfs_putapage() (141909-01)
6504953 zfs_getpage() misunderstands VOP_GETPAGE() interface (141909-01)
6702206 ZFS read/writer lock contention throttles sendfile() benchmark (141445-01)
6780491 Zone on a ZFS filesystem has poor fork/exec performance (141445-01)
6747596 assertion failed: DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig), BP_IDENTITY(zio->io_bp))); (141445-01)

8692:692d4668b40d
6801507 ZFS read aggregation should not mind the gap (141445-01)

8697:e62d2612c14d
6633095 creating a filesystem with many properties set is slow (141445-01)

8768:dfecfdbb27ed
6775697 oracle crashes when overwriting after hitting quota on zfs (141909-01)

8811:f8deccf701cf
6790687 libzfs mnttab caching ignores external changes (141445-01)
6791101 memory leak from libzfs_mnttab_init (141445-01)

8845:91af0d9c0790
6800942 smb_session_create() incorrectly stores IP addresses (N/A)
6582163 Access Control List (ACL) for shares (141445-01)
6804954 smb_search - shortname field should be space padded following the NULL terminator (N/A)
6800184 Panic at smb_oplock_conflict+0x35() (N/A)

8876:59d2e67b4b65
6803822 Reboot after replacement of system disk in a ZFS mirror drops to grub> prompt (141445-01)

8924:5af812f84759
6789318 coredump when issue zdb -uuuu poolname/ (141445-01)
6790345 zdb -dddd -e poolname coredump (141445-01)
6797109 zdb: 'zdb -dddddd pool_name/fs_name inode' coredump if the file with inode was deleted (141445-01)
6797118 zdb: 'zdb -dddddd poolname inum' coredump if I miss the fs name (141445-01)
6803343 shareiscsi=on failed, iscsitgtd failed request to share (141445-01)

9030:243fd360d81f
6815893 hang mounting a dataset after booting into a new boot environment (141445-01)

9056:826e1858a846
6809691 'zpool create -f' no longer overwrites ufs infomation (141445-01)

9179:d8fbd96b79b3
6790064 zfs needs to determine uid and gid earlier in create process (141445-01)

9214:8d350e5d04aa
6604992 forced unmount + being in .zfs/snapshot/<snap1> = not happy (141909-01)
6810367 assertion failed: dvp->v_flag & VROOT, file: ../../common/fs/gfs.c, line: 426 (141909-01)

9229:e3f8b41e5db4
6807765 ztest_dsl_dataset_promote_busy needs to clean up after ENOSPC (141445-01)

9230:e4561e3eb1ef
6821169 offlining a device results in checksum errors (141445-01)
6821170 ZFS should not increment error stats for unavailable devices (141445-01)
6824006 need to increase issue and interrupt taskqs threads in zfs (141445-01)

9234:bffdc4fc05c4
6792139 recovering from a suspended pool needs some work (141445-01)
6794830 reboot command hangs on a failed zfs pool (141445-01)

9246:67c03c93c071
6824062 System panicked in zfs_mount due to NULL pointer dereference when running btts and svvs tests (141909-01)

9276:a8a7fc849933
6816124 System crash running zpool destroy on broken zpool (141445-03)

9355:09928982c591
6818183 zfs snapshot -r is slow due to set_snap_props() doing txg_wait_synced() for each new snapshot (141445-03)

9391:413d0661ef33
6710376 log device can show incorrect status when other parts of pool are degraded (141445-03)

9396:f41cf682d0d3 (part already merged)
6501037 want user/group quotas on ZFS (141445-03)
6827260 assertion failed in arc_read(): hdr == pbuf->b_hdr (141445-03)
6815592 panic: No such hold X on refcount Y from zfs_znode_move (141445-03)
6759986 zfs list shows temporary %clone when doing online zfs recv (141445-03)

9404:319573cd93f8
6774713 zfs ignores canmount=noauto when sharenfs property != off (141445-03)

9412:4aefd8704ce0
6717022 ZFS DMU needs zero-copy support (141445-03)

9425:e7ffacaec3a8
6799895 spa_add_spares() needs to be protected by config lock (141445-03)
6826466 want to post sysevents on hot spare activation (141445-03)
6826468 spa 'allowfaulted' needs some work (141445-03)
6826469 kernel support for storing vdev FRU information (141445-03)
6826470 skip posting checksum errors from DTL regions of leaf vdevs (141445-03)
6826471 I/O errors after device remove probe can confuse FMA (141445-03)
6826472 spares should enjoy some of the benefits of cache devices (141445-03)

9443:2a96d8478e95
6833711 gang leaders shouldn't have to be logical (141445-03)

9463:d0bd231c7518
6764124 want zdb to be able to checksum metadata blocks only (141445-03)

9465:8372081b8019
6830237 zfs panic in zfs_groupmember() (141445-03)

9466:1fdfd1fed9c4
6833162 phantom log device in zpool status (141445-03)

9469:4f68f041ddcd
6824968 add ZFS userquota support to rquotad (141445-03)

9470:6d827468d7b5
6834217 godfather I/O should reexecute (141445-03)

9480:fcff33da767f
6596237 Stop looking and start ganging (141909-02)

9493:9933d599bc93
6623978 lwb->lwb_buf != NULL, file ../../../uts/common/fs/zfs/zil.c, line 787, function zil_lwb_commit (141445-06)

9512:64cafcbcc337
6801810 Commit of aligned streaming rewrites to ZIL device causes unwanted disk reads (N/A)

9515:d3b739d9d043
6586537 async zio taskqs can block out userland commands (142901-09)

9554:787363635b6a
6836768 zfs_userspace() callback has no way to indicate failure (N/A)

9574:1eb6a6ab2c57
6838062 zfs panics when an error is encountered in space_map_load() (141909-02)

9583:b0696cd037cc
6794136 Panic BAD TRAP: type=e when importing degraded zraid pool. (141909-03)

9630:e25a03f552e0
6776104 "zfs import" deadlock between spa_unload() and spa_async_thread() (141445-06)

9653:a70048a304d1
6664765 Unable to remove files when using fat-zap and quota exceeded on ZFS filesystem (141445-06)

9688:127be1845343
6841321 zfs userspace / zfs get userused@ doesn't work on mounted snapshot (N/A)
6843069 zfs get userused@S-1-... doesn't work (N/A)

9873:8ddc892eca6e
6847229 assertion failed: refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite in dmu_tx.c (141445-06)

9904:d260bd3fd47c
6838344 kernel heap corruption detected on zil while stress testing (141445-06)

9951:a4895b3dd543
6844900 zfs_ioc_userspace_upgrade leaks (N/A)

10040:38b25aeeaf7a
6857012 zfs panics on zpool import (141445-06)

10000:241a51d8720c
6848242 zdb -e no longer works as expected (N/A)

10100:4a6965f6bef8
6856634 snv_117 not booting: zfs_parse_bootfs: error2 (141445-07)

10160:a45b03783d44
6861983 zfs should use new name <-> SID interfaces (N/A)
6862984 userquota commands can hang (141445-06)

10299:80845694147f
6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (N/A)

10302:a9e3d1987706
6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (fix lint) (N/A)

10575:2a8816c5173b (partial merge)
6882227 spa_async_remove() shouldn't do a full clear (142901-14)

10800:469478b180d9
6880764 fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached (142901-09)
6793430 zdb -ivvvv assertion failure: bp->blk_cksum.zc_word[2] == dmu_objset_id(zilog->zl_os) (N/A)

10801:e0bf032e8673 (partial merge)
6822816 assertion failed: zap_remove_int(ds_next_clones_obj) returns ENOENT (142901-09)

10810:b6b161a6ae4a
6892298 buf->b_hdr->b_state != arc_anon, file: ../../common/fs/zfs/arc.c, line: 2849 (142901-09)

10890:499786962772
6807339 spurious checksum errors when replacing a vdev (142901-13)

11249:6c30f7dfc97b
6906110 bad trap panic in zil_replay_log_record (142901-13)
6906946 zfs replay isn't handling uid/gid correctly (142901-13)

11454:6e69bacc1a5a
6898245 suspended zpool should not cause rest of the zfs/zpool commands to hang (142901-10)

11546:42ea6be8961b (partial merge)
6833999 3-way deadlock in dsl_dataset_hold_ref() and dsl_sync_task_group_sync() (142901-09)

Discussed with: pjd
Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 months


209721 06-Jul-2010 rpaulo

Merge from vendor-sys/opensolaris:
* add fasttrap files


209275 17-Jun-2010 mm

Import latest ARC change from OpenSolaris:
- large ghost eviction causes high write latency
- arc_adjust might adjust MRU unnecessarily
- arc_adapt can lead to wild arc_p adjustment

OpenSolaris onnv-revision: 12636:13b5d698941e

Submitted by: avg
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6950219, 6953403, 6951024)
MFC after: 1 month


209261 17-Jun-2010 pjd

Turn off UMA allocations on all archs by default. It isn't stable even on
amd64.

Reported by: many
MFC after: 3 days


209230 16-Jun-2010 pjd

Remove redundant assignment.

MFC after: 3 days


209101 12-Jun-2010 mm

Fix arc_read_done may try to byteswap undefined data (sparc related)

OpenSolaris onnv-revision: 10839:cf83b553a2ab

Obtained from: OpenSolaris (Bug ID 6836714)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209100 12-Jun-2010 mm

Fix panic in zfs_getsecattr

OpenSolaris onnv-revision: 10295:f7a18a1e9610

Obtained from: OpenSolaris (Bug ID 6870564)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209099 12-Jun-2010 mm

Fix possible zfs panic on zpool import

OpenSolaris onnv-revision: 10040:38b25aeeaf7a

Obtained from: OpenSolaris (Bug ID 6857012)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209098 12-Jun-2010 mm

Fix zpool resilver stalls with spa_scrub_thread in a 3 way deadlock

OpenSolaris onnv-revision: 9997:174d75a29a1c

Obtained from: OpenSolaris (Bug ID 6843235)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209097 12-Jun-2010 mm

Fix ZFS panic deadlock: cycle in blocking chain via zfs_zget

OpenSolaris onnv-revision: 9774:0bb234ab2287

Obtained from: OpenSolaris (Bug ID 6788152)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209096 12-Jun-2010 mm

Fix vdev_probe() starvation brings txg train to a screeching halt

OpenSolaris onnv-revision: 9722:e3866bad4e96

Obtained from: OpenSolaris (Bug ID 6844069)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209095 12-Jun-2010 mm

Fix incomplete resilvering after disk replacement (raidz)

OpenSolaris onnv-revision: 9434:3bebded7c76a

Obtained from: OpenSolaris (Bug ID 6794570)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209094 12-Jun-2010 mm

Fix zfs destroy fails to free object in open context, stops up txg train

OpenSolaris onnv-revision: 9409:9dc3f17354ed

Obtained from: OpenSolaris (Bug ID 6809683)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209093 12-Jun-2010 mm

Fix unable to remove a file over NFS after hitting refquota limit

OpenSolaris onnv-revision: 8890:8c2bd5f17bf2

Obtained from: OpenSolaris (Bug ID 6798878)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209059 11-Jun-2010 jhb

Update several places that iterate over CPUs to use CPU_FOREACH().


208775 03-Jun-2010 mm

Fix freeing space after deleting large files with holes.

OpenSolaris onnv revision: 9950:78fc41aa9bc5

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6792701)
MFC after: 3 days


208689 01-Jun-2010 mm

Fix ZIL close when doing zfs rollback or zfs receive on a mounted dataset.

The fix is a partial import and merge of OpenSolaris onnv revisions
8227:f7d7be9b1f56. and 9292:e112194b5b73

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6798298)
MFC after: 3 days


208683 31-May-2010 pjd

Fix a bug where resilver is not started automatically on pool import or load.
If disk was missing on pool load or import and on next pool load or import
it was present, resilver wasn't started automatically and ZFS reported all disks
as ONLINE and healthy. Then, when another disk died, pool became unaccessible,
because if it was 2-way mirror or RAIDZ1 two vdevs were out of sync.

To fix the problem, start resilver automatically on pool load or import.

Obtained from: OpenSolaris
MFC after: 3 days


208682 31-May-2010 pjd

Fix panic when reading label from provider with non power of 2 sector size.

Reported by: James R. Van Artsdalen <james-freebsd-fs2@jrv.org>
MFC after: 3 days


208474 23-May-2010 mm

Remove kstat.zfs.arcstats.l2_write_bytes_written

The arcstats.l2_write_bytes_written kstat counter introduced
in r205231 was duplicite with vendor's arcstats.l2_write_bytes counter
imported in r208373 (OpenSolaris revision 8582:df9361868dbe)

Approved by: pjd, delphij (mentor)
MFC after: 3 days


208472 23-May-2010 mm

Fix zfs receive temporarily changing unchanged stream properties.
Fix possible panic with zfs_enable_datasets.

OpenSolaris onnv revision: 8536:33bd5de3260e

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6748561, 6757075)
MFC after: 3 days


208458 23-May-2010 pjd

Create UMA zones unconditionally.

MFC after: 3 days


208454 23-May-2010 pjd

Remove ZIO_USE_UMA from arc.c as well.

MFC after: 3 days


208453 23-May-2010 kib

Reorganize syscall entry and leave handling.

Extend struct sysvec with three new elements:
sv_fetch_syscall_args - the method to fetch syscall arguments from
usermode into struct syscall_args. The structure is machine-depended
(this might be reconsidered after all architectures are converted).
sv_set_syscall_retval - the method to set a return value for usermode
from the syscall. It is a generalization of
cpu_set_syscall_retval(9) to allow ABIs to override the way to set a
return value.
sv_syscallnames - the table of syscall names.

Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding
the call to cpu_set_syscall_retval().

The new functions syscallenter(9) and syscallret(9) are provided that
use sv_*syscall* pointers and contain the common repeated code from
the syscall() implementations for the architecture-specific syscall
trap handlers.

Syscallenter() fetches arguments, calls syscall implementation from
ABI sysent table, and set up return frame. The end of syscall
bookkeeping is done by syscallret().

Take advantage of single place for MI syscall handling code and
implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and
PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the
thread is stopped at syscall entry or return point respectively. The
EXEC flag augments SCX and notifies debugger that the process address
space was changed by one of exec(2)-family syscalls.

The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are
changed to use syscallenter()/syscallret(). MIPS and arm are not
converted and use the mostly unchanged syscall() implementation.

Reviewed by: jhb, marcel, marius, nwhitehorn, stas
Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc),
stas (mips)
MFC after: 1 month


208443 23-May-2010 mm

Fix kernel panic when calling spa_tryimport() on a corrupted pool.

OpenSolaris onnv revision: 8680:005fe27123ba

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6786321)
MFC after: 1 day


208442 23-May-2010 mm

Fix mutex_exit misorder that can cause a kernel panic.

OpenSolaris onnv revision: 8667:5c308a17eb7c

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6795440)
MFC after: 1 day


208373 21-May-2010 mm

Update L2ARC code and fix several bugs.

- improve ARC memory consumption (Bug ID 6488341)
- ARC/L2ARC metadata accounting (Bug ID 6748019)
- L2ARC turbo warmup (Bud ID 6748023)
- kstats for ARC content (Bug ID 6748023)
- kstats for evicted bytes from ARC by L2ARC state (Bud ID 6871680)
- fix panic on i386 systems (Bug ID 6821260)

OpenSolaris onnv revisions:
8582:df9361868dbe, 8628:97dcded6e556, 9215:7c4584f76b47,
9274:a10f8bd993c1, 10357:29060492b29d

OpenSolaris Bug IDs:
6748019, 6748023, 6748030, 6488341, 6798268, 6821260, 6790261, 6871680

Approved by: pjd, delphij (mentor)
Obtained from: OpenSlaris (multiple bug IDs)
MFC after: 3 days


208372 21-May-2010 mm

Reorder some already introduced locking variables.

OpenSolaris onnv revision: 8214:d7abf7c1f1c1

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6747934)
MFC after: 3 days


208371 21-May-2010 mm

Fix stack overflow in zfs send.

OpenSolaris onnv-revision: 8012:8ea30813950f

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6765626)
MFC after: 3 days


208370 21-May-2010 mm

Fix: vdev_reopen() can lead to failed allocations

OpenSolaris onnv-revision: 7980:589f37f25048

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6764914)
MFC after: 3 days


208166 16-May-2010 pjd

Fix userland build by making io_task available only for the kernel and by
providing taskq_dispatch_safe() macro.

MFC after: 1 week


208148 16-May-2010 pjd

Allow to configure UMA usage for ZIO data via loader and turn it on by
default for amd64. On i386 I saw performance degradation when UMA was used,
but for amd64 it should help.

MFC after: 3 days


208147 16-May-2010 pjd

Add task structure to zio and use it instead of allocating one.
This eliminates the only place where we can sleep when calling zio_interrupt().
As a side-effect this can actually improve performance a little as we
allocate one less thing for every I/O.

Prodded by: kib
MFC after: 1 week


208142 16-May-2010 pjd

The whole point of having dedicated worker thread for each leaf VDEV was to
avoid calling zio_interrupt() from geom_up thread context. It turns out that
when provider is forcibly removed from the system and we kill worker thread
there can still be some ZIOs pending. To complete pending ZIOs when there is
no worker thread anymore we still have to call zio_interrupt() from geom_up
context. To avoid this race just remove use of worker threads altogether.
This should be more or less fine, because I also thought that zio_interrupt()
does more work, but it only makes small UMA allocation with M_WAITOK.
It also saves one context switch per I/O request.

PR: kern/145339
Reported by: Alex Bakhtin <Alex.Bakhtin@gmail.com>
MFC after: 1 week


208131 16-May-2010 mm

Fix deadlock between zfs_dirent_lock and zfs_rmdir

OpenSolaris onnv revision: 11321:506b7043a14c

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6847615)
MFC after: 3 days


208130 16-May-2010 mm

Fix perfomance problem with ZFS prefetch caching [1]
Add statistics for ZFS prefetch (sysctl kstat.zfs.misc.zfetchstats)

Partial import of OpenSolaris onnv revision 10474:0e96dd3b905a

Reported by: jhell@dataix.net (private e-mail) [1]
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6859997, 6868951)
MFC after: 3 days


208050 13-May-2010 mm

Fix ZIL-related panic on zfs rollback.

OpenSolaris onnv-revision: 8746:e1d96ca6808c

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6796377)
MCF after: 1 week


208047 13-May-2010 mm

Import OpenSolaris revision 7837:001de5627df3
It includes the following changes:
- parallel reads in traversal code (Bug ID 6333409)
- faster traversal for zfs send (Bug ID 6418042)
- traversal code cleanup (Bug ID 6725675)
- fix for two scrub related bugs (Bug ID 6729696, 6730101)
- fix assertion in dbuf_verify (Bug ID 6752226)
- fix panic during zfs send with i/o errors (Bug ID 6577985)
- replace P2CROSS with P2BOUNDARY (Bug ID 6725680)

List of OpenSolaris Bug IDs:
6333409, 6418042, 6757112, 6725668, 6725675, 6725680,
6725698, 6729696, 6730101, 6752226, 6577985, 6755042

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 1 week


208030 13-May-2010 trasz

Add missing check to prevent local users from panicing the kernel by trying
to set malformed ACL.

MFC after: 3 days


207956 12-May-2010 mm

Fix possible hang when replaying large truncations.

OpenSolaris onnv revision: 7904:6a124a4ca9c5

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6761624)
MFC after: 3 days


207937 11-May-2010 pjd

I added vfs_lowvnodes event, but it was only used for a short while and now
it is totally unused. Remove it.

MFC after: 3 days


207936 11-May-2010 pjd

Eventhough r203504 eliminates taste traffic provoked by vdev_geom.c,
ZFS still like to open all vdevs, close them and open them again,
which in turn provokes taste traffic anyway.

I don't know of any clean way to fix it, so do it the hard way - if we can't
open provider for writing just retry 5 times with 0.5 pauses. This should
elimitate accidental races caused by other classes tasting providers created on
top of our vdevs.

MFC after: 3 days
Reported by: James R. Van Artsdalen <james-freebsd-fs2@jrv.org>
Reported by: Yuri Pankov <yuri.pankov@gmail.com>


207934 11-May-2010 pjd

Add missing new line characters to the warnings.

MFC after: 3 days


207911 11-May-2010 mm

Fix failed assertion on destroying datasets from an older pool version.

OpenSolaris onnv revision: 9390:887948510f80

PR: kern/146471
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6826861)
MFC after: 3 days


207910 11-May-2010 mm

Fix possible panic with zfs destroy.

OpenSolaris onnv revision: 8779:f164e0e90508

PR: kern/146471
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6784924)
MFC after: 3 days


207909 11-May-2010 mm

Fix zfs rename (may occasionally fail with dataset busy).

OpenSolaris onnv revision: 8517:41a0783dde17

PR: kern/146471
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6784757)
MFC after: 3 days


207908 11-May-2010 mm

Fix endianess bug in ZFS intent log (ZIL).

OpenSolaris onnv revision: 8109:6147a1bdd359

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6760048)
MFC after: 3 days


207745 07-May-2010 trasz

Enforce RLIMIT_FSIZE in ZFS.

Reviewed by: pjd@


207736 07-May-2010 mckusick

Merger of the quota64 project into head.

This joint work of Dag-Erling Smørgrav and myself updates the
FFS quota system to support both traditional 32-bit and new 64-bit
quotas (for those of you who want to put 2+Tb quotas on your users).

By default quotas are not compiled into the kernel. To include them
in your kernel configuration you need to specify:

options QUOTA # Enable FFS quotas

If you are already running with the current 32-bit quotas, they
should continue to work just as they have in the past. If you
wish to convert to using 64-bit quotas, use `quotacheck -c 64';
if you wish to revert from 64-bit quotas back to 32-bit quotas,
use `quotacheck -c 32'.

There is a new library of functions to simplify the use of the
quota system, do `man quotafile' for details. If your application
is currently using the quotactl(2), it is highly recommended that
you convert your application to use the quotafile interface.
Note that existing binaries will continue to work.

Special thanks to John Kozubik of rsync.net for getting me
interested in pursuing 64-bit quota support and for funding
part of my development time on this project.


207683 05-May-2010 marius

- Fix broken symlinks on cross platform zfs send/recv. [1]
- Enable zfs_ace_byteswap() on FreeBSD as it works just fine (tested between
amd64 and sparc64 in both directions by Michael Moll).

PR: 146272
Approved by: mm, pjd
Obtained from: OpenSolaris (onnv rev. 8283:1ca59f393041; Bug ID 6764193) [1]
MFC after: 3 days


207670 05-May-2010 mm

Introduce hardforce export option (-F) for "zpool export".
When exporting with this flag, zpool.cache remains untouched.

OpenSolaris onnv revision: 8211:32722be6ad3b

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID: 6775357)


207626 04-May-2010 mm

Speed up ZFS list operation with objset prefetching.

Partial import of OpenSolaris onnv revisions:
8415:8809e849f63e, 10474:0e96dd3b905a

PR: kern/146297
Submitted by: myself
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6386929, 6755389, 6847118)
MFC after: 2 weeks


207624 04-May-2010 mm

Fix deadlock during zfs receive.

OpenSolaris onnv revision: 9299:8809e849f63e

PR: kern/146296
Submitted by: myself
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6783818, 6826836)
MFC after: 1 week


207481 01-May-2010 mm

Add sysctl and loader tunable vfs.zfs.txg.write_limit_override.
This tunable improves fine-tuning of ZFS write throttling.

PR: kern/146108
Suggested by: Nikolay Denev <ndenev at gmail.com>
Approved by: pjd, delphij (mentor)
MFC after: 2 weeks


207480 01-May-2010 mm

Change description of tunable group vfs.zfs.txg to be more
understandable.

Approved by: pjd, delphij (mentor)
MFC after: 3 days


207427 30-Apr-2010 mm

Fix improper pool write throughput calculation.

OpenSolaris onnv revision: 9366:17553395a745

PR: kern/146108
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris, Bug ID 6817339
MFC after: 2 weeks


207334 28-Apr-2010 pjd

Backport fix for 'zfs_znode_dmu_init: existing znode for dbuf' panic from OpenSolaris.

PR: kern/144402
Reported by: Alex Bakhtin <alex.bakhtin@gmail.com>
Tested by: Alex Bakhtin <alex.bakhtin@gmail.com>
Obtained from: OpenSolaris, Bug ID 6895088
MFC after: 3 days


207068 22-Apr-2010 pjd

Allow to modify directory's content even if the ZFS_NOUNLINK (SF_NOUNLINK,
sunlnk) flag is set. We only deny dirctory's removal or rename.

PR: kern/143343
Reported by: marck
MFC after: 3 days


206901 20-Apr-2010 rpaulo

Rename the cyclic global variable lapic_cyclic_clock_func to just
cyclic_clock_func. This will make more sense when we start developing non
x86 cyclic version.


206900 20-Apr-2010 rpaulo

The amd64 version of the cyclic dtrace module is a verbatim copy of the
i386 version, so instead having a copy of the same file, use Makefile
foo to include the i386 version on amd64.


206838 19-Apr-2010 delphij

Partially MFp4 #176265 by pjd@:

- Properly initialize and destroy system_taskq.
- Add a dummy implementation of taskq_create_proc().

Note: We do not currently use system_taskq in ZFS so this is mostly a
no-op at this time. Proper system_taskq initialization is required
by newer ZFS code.

Ok'ed by: pjd
MFC after: 2 weeks


206797 18-Apr-2010 pjd

Restore previous order.


206796 18-Apr-2010 pjd

Style fixes.


206795 18-Apr-2010 pjd

Add missing list and lock destruction.


206794 18-Apr-2010 pjd

Extend locks scope to match OpenSolaris.


206793 18-Apr-2010 pjd

Remove racy assertion.

Obtained from: OpenSolaris


206792 18-Apr-2010 pjd

Set ARC_L2_WRITING on L2ARC header creation.

Obtained from: OpenSolaris


206667 15-Apr-2010 pjd

Fix 3-way deadlock that can happen because of ZFS and vnode lock
order reversal.

thread0 (vfs_fhtovp) thread1 (vop_getattr) thread2 (zfs_recv)
-------------------- --------------------- ------------------
vn_lock
rrw_enter_read
rrw_enter_write (hangs)
rrw_enter_read (hangs)
vn_lock (hangs)

Submitted by: Attila Nagy <bra@fsn.hu>
MFC after: 3 days


205346 19-Mar-2010 pjd

The same code is used to import and to create pool.
The order of operations is the following:
1. Try to open vdev by remembered path and guid.
2. If 1 failed, try to find vdev which guid matches and ignore the path.
3. If 2 failed this means either that the vdev we're looking for is gone
or that pool is being created and vdev doesn't contain proper guid yet.
To be able to handle pool creation we open vdev by path anyway.

Because of 3 it is possible that we open wrong vdev on import which can lead to
confusions.

The solution for this is to check spa_load_state. On pool creation it will be
equal to SPA_LOAD_NONE and we can open vdev only by path immediately and if it
is not equal to SPA_LOAD_NONE we first open by path+guid and when that fails,
we open by guid. We no longer open wrong vdev on import.

MFC after: 2 weeks


205264 17-Mar-2010 kmacy

- cache line align arcs_lock array (h/t Marius Nuennerich)
- fix ARCS_LOCK_PAD to use architecture defined CACHE_LINE_SIZE
- cache line align buf_hash_table ht_locks array

MFC after: 7 days


205253 17-Mar-2010 kmacy

use CACHE_LINE_SIZE instead of hardcoding 128 for lock pad

pointed out by Marius Nuennerich and jhb@


205231 16-Mar-2010 kmacy

- reduce contention by breaking up ARC state locks in to 16 for data
and 16 for metadata
- export L2ARC tunables as sysctls
- add several kstats to track L2ARC state more precisely
- avoid holding a contended lock when atomically incrementing a
contended counter (no lock protection needed for atomics)


205133 13-Mar-2010 kmacy

fix compilation under ZIO_USE_UMA


205132 13-Mar-2010 kmacy

Don't bottleneck on acquiring the stream locks - this avoids a massive
drop off in throughput with large numbers of simultaneous reads

MFC after: 7 days


205079 12-Mar-2010 pjd

Remove bogus assertion.

Reported by: Johan Ström <johan@stromnet.se>
Obtained from: OpenSolaris, Bug ID 6827260
MFC after: 1 week


204804 06-Mar-2010 pjd

Remove racy assertion.

Reported by: Attila Nagy <bra@fsn.hu>
Obtained from: OpenSolaris, Bug ID 6827260
MFC after: 1 week


204185 22-Feb-2010 marcel

Use mf and not mf.a. The latter doesn't force memory ordering and
applies to sequential memory.


204101 19-Feb-2010 pjd

Don't set f_bsize to recordsize. It might confuse some software (like squid).

Submitted by: Alexander Zagrebin <alexz@visp.ru>
MFC after: 2 weeks


204073 18-Feb-2010 pjd

Add tunable and sysctl to skip hostid check on pool import.


203533 05-Feb-2010 delphij

Remove two files that are not needed by FreeBSD.

Approved by: pjd
MFC after: 2 weeks


203504 04-Feb-2010 pjd

Open provider for writting when we find the right one. Opening too much
providers for writing provokes huge traffic related to taste events send
by GEOM on close. This can lead to various problems with opening GEOM
providers that are created on top of other GEOM providers.

Reorted by: Kurt Touet <ktouet@gmail.com>, mr
Tested by: mr, Baginski Darren <kickbsd@ya.ru>
MFC after: 2 weeks


202964 25-Jan-2010 delphij

On FreeBSD, time_t is 64-bit for all platforms except i386 and powerpc,
where the type is 32-bit. ZFS can handle 64-bit timestamp internally
but zfs_setattr() would check if the time value can fit, we change the
checking macros to match 64-bit timestamp if the platform supports it.

This change has some downsides like, while you can import zfs on 32-bit
platforms, the timestamp would overflow if they are out of the range.

This fixes the Y2.038K issue on platforms using 64-bit timestamps.

Reviewed by: pjd
MFC after: 1 month


202129 11-Jan-2010 delphij

Report ZFS filesystem version instead of the zpool version when we say it.

Reported by: Yuri Pankov (on -fs@)
Submitted by: delphij
Approved by: pjd
MFC after: 1 week


201756 07-Jan-2010 delphij

Re-apply onnv-gate revisions 7994 and 8986 (corresponds to FreeBSD
revision 200726 and 200727). It looks like that the two revisions
were not applied in the right sequence, I found this when comparing
with the OpenSolaris code.

MFC after: 3 days
Reviewed by: mm@


201689 06-Jan-2010 delphij

Instead of assuming all vdevs are healthy, check the newest vdev label
for each vdev's status. Booting from a degraded vdev should now be
more robust.

Submitted by: Matt Reimer <mattjreimer at gmail.com>
Sponsored by: VPOP Technologies, Inc.
MFC after: 2 weeks


201684 06-Jan-2010 pjd

Teach the (gpt)zfsboot and zfsloader raidz code to use its buffers
more efficiently.

Before this patch, in the worst case memory use would increase
exponentially on the number of drives in the raidz vdev.

Submitted by: Matt Reimer <mattjreimer@gmail.com>
Sponsored by: VPOP Technologies, Inc.
Silence from: dfr


201406 02-Jan-2010 delphij

Reduce diff against OpenSolaris - move Giant acquire/release to
zfs_znode.c. As a side effect this also eliminates two potential
Giant leaks.

Approved by: pjd
MFC after: 1 month


201143 28-Dec-2009 delphij

Apply OpenSolaris revision 8012 which brings our zpool to version 14,
making it possible for zpools created on OpenSolaris 2009.06 be used
on FreeBSD.

PR: kern/141800
Submitted by: mm
Reviewed by: pjd, trasz
Obtained from: OpenSolaris
MFC after: 2 weeks


200727 19-Dec-2009 delphij

Apply fix for Solaris bug 6462803: zfs snapshot -r failed because
filesystem was busy
(onnv revision 8989)

Submitted by: mm
Approved by: pjd
Obtained from: OpenSolaris
MFC after: 2 weeks


200726 19-Dec-2009 delphij

Apply fix for Solaris bug 6801979: zfs recv can fail with E2BIG
(onnv revision 8986)

Requested by: mm
Submitted by: pjd
Obtained from: OpenSolaris
MFC after: 2 weeks


200724 19-Dec-2009 delphij

Apply fix Solaris bug 6462803 zfs snapshot -r failed because
filesystem was busy.

Submitted by: mm
Approved by: pjd
MFC after: 2 weeks


200162 05-Dec-2009 kib

Change VOP_FSYNC for zfs vnode from VOP_PANIC to zfs_freebsd_fsync(),
both to not panic when fsync(2) is called for fifo on zfs
filedescriptor, and to actually fsync fifo inode to permanent storage.

PR: kern/141177
Reviewed by: pjd
MFC after: 1 week


200158 05-Dec-2009 pjd

We have to eventually look for provider without checking guid as this is need
for attaching when there is no metadata yet.

Before r200125 the order of looking for providers was wrong. It was:
1. Find provider by name.
2. Find provider by guid.
3. Find provider by name and guid.

Where it should have been:
1. Find provider by name and guid.
2. Find provider by guid.
3. Find provider by name.

MFC after: 1 week


200126 05-Dec-2009 pjd

Fix deadlock when ZVOLs are present and we are replacing dead component or
calling scrub when pool is in a degraded state. It will try to taste ZVOLs,
which will lead to deadlock, as ZVOL will try to acquire the same locks as
replace/scrub is holding already.

We can't simply skip provider based on their GEOM class, because ZVOL can have
providers build on top of it and we need to skip those as well.

We do it by asking for ZFS::iszvol attribute. Any ZVOL-based provider will give
us positive answer and we have to skip those providers.

This way we remove possibility to create ZFS pools on top of ZVOLs, but it is
not very useful anyway.

I believe deadlock is still possible in some very complex situations like when
we have MD provider on top of UFS file on top of ZVOL. When we try to replace
dead component in the pool mentioned ZVOL is based on, there might be a
deadlock when ZFS will try to taste MD provider. There is no easy way to detect
that, but it isn't very common.

MFC after: 1 week


200125 05-Dec-2009 pjd

Always check guid when opening by path, because we may end up with provider
that does have the same name, but only by accident.

MFC after: 1 week


200124 05-Dec-2009 pjd

Avoid using additional variable for storing an error if we are not going
to do anything with it.


199241 13-Nov-2009 ps

Correct another case of not doing 64bit math. This allows mine and
other raidz2 volumes to boot.

Submitted by: Matt Reimer <mattjreimer@gmail.com>


199157 10-Nov-2009 pjd

Be careful which vattr fields are set during setattr replay.
Without this fix strange things can appear after unclean shutdown like
files with mode set to 07777.

Reported by: des
MFC after: 3 days


199156 10-Nov-2009 pjd

Avoid passing invalid mountpoint to getnewvnode().

Reported by: rwatson
Tested by: rwatson
MFC after: 3 days


198703 30-Oct-2009 pjd

- zfs_zaccess() can handle VAPPEND too, so map V_APPEND to VAPPEND and call
zfs_access() instead of vaccess() in this case as well.
- If VADMIN is specified with another V* flag (unlikely) call both
zfs_access() and vaccess() after spliting V* flags.

This fixes "dirtying snapshot!" panic.

PR: kern/139806
Reported by: Carl Chave <carl@chave.us>
In co-operation with: jh
MFC after: 3 days


198420 23-Oct-2009 rnoland

Correct some issues with zfs boot.

- Teach it to read gang blocks. (essentially untested)
If you see "ZFS: gang block detected!", please let
me know, so we can either remove the printf if it
works, or fix it if it doesn't.

- If multiple partitions exist on a disk, probe them all.
We also need to reset dsk->start to 0 to read the right
sector here.

- With GPT, we can have 128 partitions.

- If the bootfs property has ever been set on a pool
it seems that it never goes away. zpool won't allow
you to add to the pool with the bootfs property set.
However, if you clear the property back to default
we end up getting 0 for the object number and read
a bogus block pointer and fail to boot.

- Fix some error printfs. The printf in the loader is
only capable of c,s and u formats.

- Teach printf how to display %llu

Reviewed by: dfr, jhb
MFC after: 2 weeks


197861 08-Oct-2009 pjd

Allow file system owner to modify system flags if securelevel permits.

MFC after: 3 days


197860 08-Oct-2009 pjd

File system owner is when uid matches and jail matches.

MFC after: 3 days


197843 07-Oct-2009 pjd

On FreeBSD it is enough to report provider removal when orphan event is
received, we don't have to do it on every ENXIO error in I/O path.
Solaris has no GEOM so they have to handle it in a less clean way.

MFC after: 3 days


197842 07-Oct-2009 pjd

Fix white-spaces.

MFC after: 3 days


197831 07-Oct-2009 pjd

Fix situation where Mac OS X NFS client creates a file and when it tries
to set ownership and mode in the same setattr operation, the mode was
overwritten by secpolicy_vnode_setattr().

PR: kern/118320
Submitted by: Mark Thompson <info-gentoo@mark.thompson.bz>
MFC after: 3 days


197816 06-Oct-2009 kmacy

Prevent paging pressure from draining arc too much
- always drain arc if above arc_c_max - never drain arc if arc is below arc_c_max

MFC after: 3 days


197683 01-Oct-2009 delphij

Return EOPNOTSUPP instead of EINVAL when doing chflags(2) over an old
format ZFS, as defined in the manual page.

Submitted by: pjd (response of my original patch but bugs are mine)
MFC after: 3 days


197515 26-Sep-2009 pjd

Handle cases where virtual (GFS) vnodes are referenced when doing forced
unmount. In that case we cannot depend on the proper order of invalidating
vnodes, so we have to free resources when we have a chance.

PR: kern/139062
Reported by: trasz
MFC after: 3 days


197514 26-Sep-2009 pjd

On lookup error VFS expects *vpp to be set to NULL, be sure to do that.

MFC after: 3 days


197513 26-Sep-2009 pjd

Use traverse() function to find and return mount point's vnode instead of
covered vnode when snapshot is already mounted.

MFC after: 3 days


197512 26-Sep-2009 pjd

- Don't depend on value returned by gfs_*_inactive(), it doesn't work
well with forced unmounts when GFS vnodes are referenced.
- Make other preparations to GFS for forced unmounts.

PR: kern/139062
Reported by: trasz
MFC after: 3 days


197497 25-Sep-2009 pjd

Switch to fletcher4 as the default checksum algorithm. Fletcher2 was proven to
be a bit weak and OpenSolaris also switched to fletcher4.

PR: kern/139072
Reported by: Daniel Grund <bugs@dgrund.de>
MFC after: 3 days


197459 24-Sep-2009 pjd

Before calling vflush(FORCECLOSE) mark file system as unmounted so the
following vnops will fail. This is very important, because without this change
vnode could be reclaimed at any point, even if we increased usecount. The only
way to ensure that vnode won't be reclaimed was to lock it, which would be very
hard to do in ZFS without changing a lot of code. With this change simply
increasing usecount is enough to be sure vnode won't be reclaimed from under
us. To be precise it can still be reclaimed but we won't be able to see it,
because every try to enter ZFS through VFS will result in EIO.

The only function that cannot return EIO, because it is needed for vflush() is
zfs_root(). Introduce ZFS_ENTER_NOERROR() macro that only locks
z_teardown_lock and never returns EIO.

MFC after: 3 days


197458 24-Sep-2009 pjd

Close race in zfs_zget(). We have to increase usecount first and then
check for VI_DOOMED flag. Before this change vnode could be reclaimed
between checking for the flag and increasing usecount.

MFC after: 3 days


197435 23-Sep-2009 trasz

In VOP_SETACL(9) and VOP_GETACL(9), specifying wrong ACL type should result
in EINVAL, not EOPNOTSUPP.


197426 23-Sep-2009 pjd

Restore BSD behaviour - when creating new directory entry use parent directory
gid to set group ownership and not process gid.

This was overlooked during v6 -> v13 switch.

PR: kern/139076
Reported by: Sean Winn <sean@gothic.net.au>
MFC after: 3 days


197351 20-Sep-2009 pjd

Purge namecache in the same place OpenSolaris does.


197289 17-Sep-2009 pjd

Purge file system namecache when receiving incremental stream and rolling back
to it.

MFC after: 3 days


197287 17-Sep-2009 pjd

Purge namecache for the file system being rolled back, so it doesn't point at
invalid vnodes after the rollback resulting in EIO errors when trying to access
files which are in the namecache.

Reported by: des
MFC after: 3 days


197219 15-Sep-2009 pjd

Forced unmounts work just fine in my tests under heavy load. There might
still be a problem, but it isn't worth a warning.


197218 15-Sep-2009 pjd

We believe ZFS is ready for production use. Remove a warning about it being
experimental. :)


197201 14-Sep-2009 pjd

- Mount ZFS snapshots with MNT_IGNORE flag, so they are not visible in regular
df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
ZFS route of not listing snapshots by default with 'zfs list' command.
- Add UPDATING entry to note that ZFS snapshots are no longer visible in
mount(8) and df(1) output by default.

Reviewed by: kib
MFC after: 3 days


197177 13-Sep-2009 pjd

Support both case: when snapshot is already mounted and when it is not yet
mounted.

MFC after: 3 days


197172 13-Sep-2009 pjd

Add missing \n.

Reported by: marck


197167 13-Sep-2009 pjd

Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
by just returning EOPNOTSUPP. This will allow NFS server to fall back to
regular READDIR.

Note that converting inode number to snapshot's vnode is expensive operation.
Snapshots are stored in AVL tree, but based on their names, not inode numbers,
so to convert inode to snapshot vnode we have to interate over all snalshots.

This is not a problem in OpenSolaris, because in their READDIRPLUS
implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
d_fileno as we do.

PR: kern/125149
Reported by: Weldon Godfrey <wgodfrey@ena.com>
Analysis by: Jaakko Heinonen <jh@saunalahti.fi>
MFC after: 3 days


197153 13-Sep-2009 pjd

When zfs.ko is compiled with debug, make sure that znode and vnode point at
each other.

MFC after: 3 days


197152 13-Sep-2009 pjd

Extend scope of the z_teardown_lock lock for consistency and "just in case".

MFC after: 3 days


197151 13-Sep-2009 pjd

Be sure not to overflow struct fid.

MFC after: 3 days


197150 13-Sep-2009 pjd

There is a bug where mze_insert() can trigger an assert() of inserting
the same entry twice. This bug is not fixed yet, but leads to situation
where when try to access corrupted directory the kernel will panic.
Until the bug is properly fixed, try to recover from it and log that it
happened.

Reported by: marck
OpenSolaris bug: 6709336
MFC after: 3 days


197133 12-Sep-2009 pjd

- Protect reclaim with z_teardown_inactive_lock.
- Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
z_dbuf field is NULL - this might happen in case of rollback or forced
unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
- On forced unmount wait for all znodes to be destroyed - destruction can be
done asynchronously via zfs_reclaim_complete().

MFC after: 1 week


197131 12-Sep-2009 pjd

Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
NULL, but also can point to dead vnode, take that into account.

PR: kern/132068
Reported by: Edward Fisk" <7ogcg7g02@sneakemail.com>, kris
Fix based on patch from: Jaakko Heinonen <jh@saunalahti.fi>
MFC after: 1 week


196985 08-Sep-2009 pjd

Only log successful commands! Without this fix we log even unsuccessful
commands executed by unprivileged users. Action is not really taken, but it is
logged to pool history, which might be confusing.

Reported by: Denis Ahrens <denis@h3q.com>
MFC after: 3 days


196982 08-Sep-2009 pjd

We don't export individual snapshots, so mnt_export field in snapshot's
mount point is NULL. That's why when we try to access snapshots over NFS
use mnt_export field from the parent file system.

MFC after: 1 week


196980 08-Sep-2009 pjd

When we automatically mount snapshot we want to return vnode of the mount point
from the lookup and not covered vnode. This is one of the fixes for using .zfs/
over NFS.

MFC after: 1 week


196979 08-Sep-2009 pjd

On FreeBSD we don't have to look for snapshot's mount point,
because fhtovp method is already called with proper mount point.

MFC after: 1 week


196978 08-Sep-2009 pjd

Call ZFS_EXIT() after locking the vnode.

MFC after: 1 week


196966 08-Sep-2009 kib

Lock Giant around vn_open_cred().
Remove innocent unnecessary call to NDFREE().

Reported by: marcel
Reviewed and tested by: pjd
MFC after: 3 days


196965 08-Sep-2009 pjd

Fix reference count leak for a case where snapshot's mount point is updated.
Such situation is not supported.

This problem was triggered by something like this:

# zpool create tank da0
# zfs snapshot tank@snap
# cd /tank/.zfs/snapshot/snap (this will mount the snapshot)
# cd
# mount -u nosuid /tank/.zfs/snapshot/snap (refcount leak)
# zpool export tank
cannot export 'tank': pool is busy

MFC after: 1 week


196954 07-Sep-2009 pjd

If we have to use avl_find(), optimize a bit and use avl_insert() instead of
avl_add() (the latter is actually a wrapper around avl_find() + avl_insert()).

Fix similar case in the code that is currently commented out.


196953 07-Sep-2009 pjd

When snapshot mount point is busy (for example we are still in it)
we will fail to unmount it, but it won't be removed from the tree,
so in that case there is no need to reinsert it.

This fixes a panic reproducable in the following steps:

# zfs create tank/foo
# zfs snapshot tank/foo@snap
# cd /tank/foo/.zfs/snapshot/snap
# umount /tank/foo
panic: avl_find() succeeded inside avl_add()

Reported by: trasz
MFC after: 3 days


196949 07-Sep-2009 trasz

Enable NFSv4 ACL support in ZFS.

Reviewed by: pjd


196947 07-Sep-2009 pjd

Defer thread start until we set priority.

Reviewed by: kib
MFC after: 3 days


196944 07-Sep-2009 pjd

Don't recheck ownership on update mount. This will eliminate LOR between
vfs_busy() and mount mutex. We check ownership in vfs_domount() anyway.

Noticed by: kib
Reviewed by: kib
MFC after: 1 week


196943 07-Sep-2009 pjd

- Avoid holding mutex around M_WAITOK allocations.
- Add locking for mnt_opt field.

MFC after: 1 week


196941 07-Sep-2009 trasz

Prevent the line from wrapping.


196927 07-Sep-2009 pjd

Changing provider size is not really supported by GEOM, but doing so when
provider is closed should be ok.

When administrator requests to change ZVOL size do it immediately if ZVOL
is closed or do it on last ZVOL close.

PR: kern/136942
Requested by: Bernard Buri <bsd@ask-us.at>
MFC after: 1 week


196919 07-Sep-2009 pjd

bzero() on-stack argument, so mutex_init() won't misinterpret that the
lock is already initialized if we have some garbage on the stack.

PR: kern/135480
Reported by: Emil Mikulic <emikulic@gmail.com>
MFC after: 3 days


196863 05-Sep-2009 trasz

Improve wording.

Discussed with: pjd, cperciva, rink, wkoszek and des, in order of appearance.


196703 31-Aug-2009 pjd

Backport the 'dirtying dbuf' panic fix from newer ZFS version.

Reported by: Thomas Backman <serenity@exscape.org>
MFC after: 1 week


196702 31-Aug-2009 pjd

Remove empty directory.


196662 30-Aug-2009 pjd

Add missing mountpoint vnode locking.

This fixes panic on assertion with DEBUG_VFS_LOCKS and vfs.usermount=1 when
regular user tries to mount dataset owned by him.

MFC after: 1 week


196458 23-Aug-2009 pjd

- Hide ZFS kernel threads under zfskern process.
- Use better (shorter) threads names:
'zvol:worker zvol/tank/vol00' -> 'zvol tank/vol00'
'vdev:worker da0' -> 'vdev da0'


196457 23-Aug-2009 pjd

Set priority of vdev_geom threads and zvol threads to PRIBIO.


196456 23-Aug-2009 pjd

- Give minclsyspri and maxclsyspri real values (consulted with kmacy).
- Honour 'pri' argument for thread_create().


196395 20-Aug-2009 pjd

Our libc doesn't implement control method for XDR (only kernel does) and it
will always return failure. Fix this by bringing userland implementation of
xdrmem_control() back. This allow 'zpool import' to work again.

Reported by: Thomas Backman <serenity@exscape.org>
Reviewed by: kmacy
Approved by: re (kib)


196309 17-Aug-2009 pjd

getcwd() (when __getcwd() fails) works by stating current directory, going up
(..), calling readdir and looking for previous directory inode. In case of
.zfs/ directory this doesn't work, because .zfs/ is hidden by default, so it
won't be visible in readdir output.

Fix this by implementing VPTOCNP for snapshot directories, so __getcwd()
doesn't fail and getcwd() doesn't have to use readdir method.

This fixes /bin/pwd from within .zfs/snapshot/<name>/.

Suggested by: kib
Approved by: re (rwatson)


196307 17-Aug-2009 pjd

Manage asynchronous vnode release just like Solaris.

Discussed with: kmacy
Approved by: re (kib)


196303 17-Aug-2009 pjd

- Reduce z_teardown_lock lock scope a bit.
- The error variable is int, not bool.
- Convert spaces to tabs where needed.

Approved by: re (kib)


196301 17-Aug-2009 pjd

If z_buf is NULL, we should free znode immediately.

Noticed by: avg
Approved by: re (kib)


196299 17-Aug-2009 pjd

- We need to recycle vnode instead of freeing znode.

Submitted by: avg

- Add missing vnode interlock unlock.
- Remove redundant znode locking.

Approved by: re (kib)


196297 17-Aug-2009 pjd

Fix panic in zfs recv code. The last vnode (mountpoint's vnode) can have
0 usecount.

Reported by: Thomas Backman <serenity@exscape.org>
Approved by: re (kib)


196295 17-Aug-2009 pjd

Remove OpenSolaris taskq port (it performs very poorly in our kernel) and
replace it with wrappers around our taskqueue(9).
To make it possible implement taskqueue_member() function which returns 1
if the given thread was created by the given taskqueue.

Approved by: re (kib)


196291 17-Aug-2009 pjd

- Fix a race where /dev/zfs control device is created before ZFS is fully
initialized. Also destroy /dev/zfs before doing other deinitializations.
- Initialization through taskq is no longer needed and there is a race
where one of the zpool/zfs command loads zfs.ko and tries to do some work
immediately, but /dev/zfs is not there yet.

Reported by: pav
Approved by: re (kib)


196289 17-Aug-2009 pjd

Remove files that are no longer used.

Discussed with: kmacy
Approved by: re (kib)


196269 16-Aug-2009 marcel

Fix misalignment in nvpair_native_embedded() caused by the compiler
replacing the bzero(). See also revision 195627, which fixed the
misalignment in nvpair_native_embedded_array().

Approved by: re (kensmith)


196179 13-Aug-2009 trasz

Remove CDDL warning.

Approved by: re (kib), core


195909 27-Jul-2009 pjd

We don't support ephemeral IDs in FreeBSD and without this fix ZFS can
panic when in zfs_fuid_create_cred() when userid is negative. It is
converted to unsigned value which makes IS_EPHEMERAL() macro to
incorrectly report that this is ephemeral ID. The most reasonable
solution for now is to always report that the given ID is not ephemeral.

PR: kern/132337
Submitted by: Matthew West <freebsd@r.zeeb.org>
Tested by: Thomas Backman <serenity@exscape.org>, Michael Reifenberger <mike@reifenberger.com>
Approved by: re (kib)
MFC after: 2 weeks


195822 22-Jul-2009 trasz

Fix extattr_list_file(2) on ZFS in case the attribute directory
doesn't exist and user doesn't have write access to the file.
Without this fix, it returns bogus value instead of 0. For some
reason this didn't manifest on my kernel compiled with -O0.

PR: kern/136601
Submitted by: Jaakko Heinonen <jh at saunalahti dot fi>
Approved by: re (kib)


195785 20-Jul-2009 trasz

Fix permission handling for extended attributes in ZFS. Without
this change, ZFS uses SunOS Alternate Data Streams semantics - each
EA has its own permissions, which are set at EA creation time
and - unlike SunOS - invisible to the user and impossible to change.
From the user point of view, it's just broken: sometimes access
is granted when it shouldn't be, sometimes it's denied when
it shouldn't be.

This patch makes it behave just like UFS, i.e. depend on current
file permissions. Also, it fixes returned error codes (ENOATTR
instead of ENOENT) and makes listextattr(2) return 0 instead
of EPERM where there is no EA directory (i.e. the file never had
any EA).

Reviewed by: pjd (idea, not actual code)
Approved by: re (kib)


195710 15-Jul-2009 avg

dtrace_gethrtime: improve scaling of TSC ticks to nanoseconds

Currently dtrace_gethrtime uses formula similar to the following for
converting TSC ticks to nanoseconds:
rdtsc() * 10^9 / tsc_freq
The dividend overflows 64-bit type and wraps-around every 2^64/10^9 =
18446744073 ticks which is just a few seconds on modern machines.

Now we instead use precalculated scaling factor of
10^9*2^N/tsc_freq < 2^32 and perform TSC value multiplication separately
for each 32-bit half. This allows to avoid overflow of the dividend
described above.
The idea is taken from OpenSolaris.
This has an added feature of always scaling TSC with invariant value
regardless of TSC frequency changes. Thus the timestamps will not be
accurate if TSC actually changes, but they are always proportional to
TSC ticks and thus monotonic. This should be much better than current
formula which produces wildly different non-monotonic results on when
tsc_freq changes.

Also drop write-only 'cp' variable from amd64 dtrace_gethrtime_init()
to make it identical to the i386 twin.

PR: kern/127441
Tested by: Thomas Backman <serenity@exscape.org>
Reviewed by: jhb
Discussed with: current@, bde, gnn
Silence from: jb
Approved by: re (gnn)
MFC after: 1 week


195702 14-Jul-2009 kib

Add new msleep(9) flag PBDY that shall be specified together with
PCATCH, to indicate that thread shall not be stopped upon receipt of
SIGSTOP until it reaches the kernel->usermode boundary.

Also change thread_single(SINGLE_NO_EXIT) to only stop threads at
the user boundary unconditionally.

Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)


195627 11-Jul-2009 marcel

In nvpair_native_embedded_array(), meaningless pointers are zeroed.
The programmer was aware that alignment was not guaranteed in the
packed structure and used bzero() to NULL out the pointers.
However, on ia64, the compiler is quite agressive in finding ILP
and calls to bzero() are often replaced by simple assignments (i.e.
stores). Especially when the width or size in question corresponds
with a store instruction (i.e. st1, st2, st4 or st8).

The problem here is not a compiler bug. The address of the memory
to zero-out was given by '&packed->nvl_priv' and given the type of
the 'packed' pointer the compiler could assume proper alignment for
the replacement of bzero() with an 8-byte wide store to be valid.
The problem is with the programmer. The programmer knew that the
address did not have the alignment guarantees needed for a regular
assignment, but failed to inform the compiler of that fact. In
fact, the programmer told the compiler the opposite: alignment is
guaranteed.

The fix is to avoid using a pointer of type "nvlist_t *" and
instead use a "char *" pointer as the basis for calculating the
address. This tells the compiler that only 1-byte alignment can
be assumed and the compiler will either keep the bzero() call
or instead replace it with a sequence of byte-wise stores. Both
are valid.

Approved by: re (kib)


194850 24-Jun-2009 avg

dtrace/amd64: fix virtual address checks

On amd64 KERNBASE/kernbase does not mean start of kernel memory.
This should fix a KASSERT panic in dtrace_copycheck when copyin*()
is used in D program.
Also make checks for user memory a bit stricter.

Reported by: Thomas Backman <serenity@exscape.org>
Submitted by: wxs (kaddr part)
Tested by: Thomas Backman (prototype), wxs
Reviewed by: alc (concept), jhb, current@
Aprroved by: jb (concept)
MFC after: 2 weeks
PR: kern/134408


194617 22-Jun-2009 kib

O_NOFOLLOW shall be in flags, not in cmode.

Noted by: bde


194586 21-Jun-2009 kib

Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson


194453 18-Jun-2009 jhb

Bootstrap mergeinfo for the OpenSolaris contrib bits.


194300 16-Jun-2009 jhb

Remove confusing mergeinfo caused by renaming files.


194118 13-Jun-2009 jamie

Rename the host-related prison fields to be the same as the host.*
parameters they represent, and the variables they replaced, instead of
abbreviated versions of them.

Approved by: bz (mentor)


194043 11-Jun-2009 kmacy

pjd has requested that I keep the tunable as zfs_prefetch_disable to minimize gratuitous
differences with Opensolaris' ZFS

Sorry for the churn


193980 11-Jun-2009 kmacy

check against prefetch_enable


193953 10-Jun-2009 kmacy

use default policy for enabling prefetching unless the TUNABLE is set


193878 10-Jun-2009 kmacy

As far as I can tell systems that have less than 4GB are more often hurt
by prefetched than helped. On i386 systems and systems with less than 4GB,
prefetch is now disabled by default. I've added a prefetch enable tunable, to
enable prefetching for those systems. The prefetch disable tunable will continue
to unconditionally disable prefetching.


193440 04-Jun-2009 ps

Support shared vnode locks for write operations when the offset is
provided on filesystems that support it. This really improves mysql
+ innodb performance on ZFS.

Reviewed by: jhb, kmacy, jeffr


193163 31-May-2009 dfr

Allow the bootfs property to be set for raidz pools on FreeBSD.

Reviewed by: pjd


193128 30-May-2009 kmacy

fix xdrmem_control to be safe in an if statement
fix zfs to depend on krpc
remove xdr from zfs makefile

Submitted by: dchagin@freebsd.org


193110 30-May-2009 kmacy

work around snapshot shutdown race reported by Henri Hennebert


193066 29-May-2009 jamie

Place hostnames and similar information fully under the prison system.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex. Jails may
have their own host information, or they may inherit it from the
parent/system. The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL. The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.

The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.

Approved by: bz (mentor)


193011 29-May-2009 attilio

Reverse the logic for ADAPTIVE_SX option and enable it by default.
Introduce for this operation the reverse NO_ADAPTIVE_SX option.
The flag SX_ADAPTIVESPIN to be passed to sx_init_flags(9) gets suppressed
and the new flag, offering the reversed logic, SX_NOADAPTIVE is added.

Additively implements adaptive spininning for sx held in shared mode.
The spinning limit can be handled through sysctls in order to be tuned
while the code doesn't reach the release, after which time they should
be dropped probabilly.

This change has made been necessary by recent benchmarks where it does
improve concurrency of workloads in presence of high contention
(ie. ZFS).

KPI breakage is documented by __FreeBSD_version bumping, manpage and
UPDATING updates.

Requested by: jeff, kmacy
Reviewed by: jeff
Tested by: pho


192983 28-May-2009 des

Nobody spoke up, so assume my interpretation was correct and enable keyword
expansion for this file.


192971 28-May-2009 kmacy

MFdevbranch 192944
- add FreeBSD implementation of xdrmem_control needed by zfs
- have zfs define xdr_ops using FreeBSD's definition
- remove solaris xdr files from zfs compile


192853 26-May-2009 sson

Add the OpenSolaris dtrace lockstat provider. The lockstat provider
adds probes for mutexes, reader/writer and shared/exclusive locks to
gather contention statistics and other locking information for
dtrace scripts, the lockstat(1M) command and other potential
consumers.

Reviewed by: attilio jhb jb
Approved by: gnn (mentor)


192804 26-May-2009 trasz

Change license to more bori^Wadul^Wcanonical.

Submitted by: rwatson@


192800 26-May-2009 trasz

MFp4 changes neccessary for NFSv4 ACLs support in ZFS. This is mostly
about removing a few #ifdefs and providing compatibility wrappers and
VOP implementations to get and set an ACL; ZFS does ACL enforcement all
by itself.

Note that the VOPs are ifdefed out for now, so this change should be
a no-op.

Reviewed by: pjd


192694 24-May-2009 trasz

Don't allow non-owner to set SUID bit on a file. It doesn't make
any difference now, but in NFSv4 ACLs, there is write_acl permission,
which also affects mode changes.

Reviewed by: pjd


192689 24-May-2009 trasz

Fix comment.


192640 23-May-2009 des

Unexpand $FreeBSD$.


192639 23-May-2009 des

Remove svn:keywords on a file that had fbsd:nokeywords (though I don't
understand the reason for the latter)


192599 22-May-2009 des

Expand $FreeBSD$


192360 19-May-2009 kmacy

- back out direct map hack
- it is no longer needed


192240 17-May-2009 kmacy

set createtxg prop name

PR: bin/130105


192237 17-May-2009 kmacy

SAVESTART implies SAVENAME


192234 17-May-2009 kmacy

enable adaptive spinning on zfs locks


192211 16-May-2009 kmacy

- allow forced unmounts
- don't assume snapshot was auto-mounted


192209 16-May-2009 kmacy

only use direct map if system has more than 2GB


192207 16-May-2009 kmacy

apply band-aid to x86_64 systems with more physical memory than kmem by allocating from the direct map


192194 16-May-2009 dfr

Add support for booting from raidz1 and raidz2 pools.


191990 11-May-2009 attilio

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


191984 11-May-2009 kmacy

rename xdr support files to avoid conflicts when linking in to the kernel


191931 09-May-2009 kmacy

- rename atomic.S and crc32.c to avoid collisions when linking zfs in to the kernel
- update Makefile
- ifdef out acl_{alloc, free}, they aren't used by zfs and conflict with existing in-kernel routines


191915 08-May-2009 zec

Introduce a new virtualization container, provisionally named vprocg, to hold
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg. Struct vprocg is likely to become replaced in the near future with
a new jail management API import.

As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.

Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.

Permit kldload / kldunload operations to be executed only from the default
vimage context.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by: bz
Approved by: julian (mentor)


191907 07-May-2009 kmacy

don't call vn_rele_async_fini in the !_KERNEL case


191905 07-May-2009 kmacy

move VN_RELE_ASYNC to the compatibility layer with the rest of the VN_* defines


191903 07-May-2009 kmacy

avoid LOR and gratuitous extra lock acquisitions by moving user_evict list buffers to
a temporary list


191902 07-May-2009 kmacy

Allow the VM to provide backpressure on the ARC cache as it does
on Solaris.


191900 07-May-2009 kmacy

Asynchronously release vnodes to avoid blocking on range locks when calling back in to zfs.
This is based on a fix that went in to opensolaris on March 9th. However, it uses a dedicated
thread instead of a Solaris' taskq to avoid doing a blocking memory allocation with the vnode
interlock held.

This fixes a long-time deadlock in ZFS. This is not, strictly speaking, an LOR. The spa_zio
thread releases a vnode, this calls in to vn_reclaim which in turn needs to acquire range locks
to sync dirty data out to disk. The range locks are already held by a user-level process waiting
on a condition variable that it the process is waiting on a spa_zio thread to signal it on. The
process could not be signalled because the spa_zio thread could not proceed.

The nature of this problem was not apparent due to ZFS locks opting out of witness which meant
that DDB did not know about the locks that were held by ZFS.

Reviewed by: pjd
MFC after: 7 days


191673 29-Apr-2009 jamie

Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by: bz (mentor)


191517 26-Apr-2009 ed

Remove empty directories from the HEAD.

Discussed with: developers, imp


190888 10-Apr-2009 rwatson

Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon


190878 10-Apr-2009 thompsa

Revert r190676,190677

The geom and CAM changes for root_hold are the wrong solution for USB design
quirks.

Requested by: scottl


190676 03-Apr-2009 thompsa

Add a how argument to root_mount_hold() so it can be passed NOWAIT and be called
in situations where sleeping isnt allowed.


190581 30-Mar-2009 mav

Integrate user/mav/ata branch:

Add ch_suspend/ch_resume methods for PCI controllers and implement them
for AHCI. Refactor AHCI channel initialization according to it.

Fix Port Multipliers operation. It is far from perfect yet, but works now.
Tested with JMicron JMB363 AHCI + SiI 3726 PMP pair.
Previous version was also tested with SiI 4726 PMP.

Hardware sponsored by: Vitsch Electronics / VEHosting.nl


190419 25-Mar-2009 rwatson

Move dtnfsclient.c in the cddl tree to nfs_kdtrace.c in the nfsclient
directory, since it's under a BSD license, and this keeps NFS internals-
aware tracing parts close to NFS.

MFC after: 1 month
Suggested by: jhb


190380 24-Mar-2009 rwatson

Add DTrace probes to the NFS access and attribute caches. Access cache
events are:

nfsclient:accesscache:flush:done
nfsclient:accesscache:get:hit
nfsclient:accesscache:get:miss
nfsclient:accesscache:load:done

They pass the vnode, uid, and requested or loaded access mode (if any);
the load event may also report a load error if the RPC fails.

The attribute cache events are:

nfsclient:attrcache:flush:done
nfsclient:attrcache:get:hit
nfsclient:attrcache:get:miss
nfsclient:attrcache:load:done

They pass the vnode, optionally the vattr if one is present (hit or load),
and in the case of a load event, also a possible RPC error.

MFC after: 1 month
Sponsored by: Google, Inc.


190293 22-Mar-2009 rwatson

Add dtnfsclient, a first cut at an NFSv2/v3 client reuest DTrace
provider. The NFS client exposes 'start' and 'done' probes for NFSv2
and NFSv3 RPCs when using the new RPC implementation, passing in the
vnode, mbuf chain, credential, and NFSv2 or NFSv3 procedure number.
For 'done' probes, the error number is also available.

Probes are named in the following way:

...
nfsclient:nfs2:write:start
nfsclient:nfs2:write:done
...
nfsclient:nfs3:access:start
nfsclient:nfs3:access:done
...

Access to the unmarshalled arguments is not easily available at this
point in the stack, but the passed probe arguments are sufficient to
to a lot of interesting things in practice. Technically, these probes
may cover multiple RPC retransmits, and even transactions if the
transaction ID change as a result of authentication failure or a
jukebox error from the server, but usefully capture the intent of a
single NFS request, such as access, getattr, write, etc.

Typical use might involve profiling RPC latency by system call, number
of RPCs, how often a getattr leads to a call to access, when failed
access control checks occur, etc. More detailed RPC information might
best be provided by adding a krpc provider. It would also be useful
to add NFS client probes for events such as the access cache or
attribute cache satisfying requests without an RPC.

Sponsored by: Google, Inc.
MFC after: 1 month


189967 18-Mar-2009 jhb

The zfs_get_xattrdir() function is used to find the extended attribute
directory for a znode. When the directory already exists, it returns a
referenced but unlocked vnode. When a directory does not yet exist, it
calls zfs_make_xattrdir() to create a new one. zfs_make_xattrdir() returns
the vnode both referenced and and locked and zfs_get_xattrdir() was leaking
this vnode lock to its callers. Fix this by dropping the vnode lock if
zfs_make_xattrdir() successfully creates a new extended attribute
directory.

Reviewed by: pjd


189696 11-Mar-2009 jhb

Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month


189290 02-Mar-2009 jamie

Extend the "vfsopt" mount options for more general use. Make struct
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.

Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.

While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.

Approved by: bz (mentor)


189170 28-Feb-2009 ed

Add memmove() to the kernel, making the kernel compile with Clang.

When copying big structures, LLVM generates calls to memmove(), because
it may not be able to figure out whether structures overlap. This caused
linker errors to occur. memmove() is now implemented using bcopy().
Ideally it would be the other way around, but that can be solved in the
future. On ARM we don't do add anything, because it already has
memmove().

Discussed on: arch@
Reviewed by: rdivacky


188588 13-Feb-2009 jhb

Use shared vnode locks when invoking VOP_READDIR().

MFC after: 1 month


187830 28-Jan-2009 ed

Last step of splitting up minor and unit numbers: remove minor().

Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().

We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.

Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.

I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.


185614 04-Dec-2008 imp

Put the MIPS support back in after it was removed in r185029.


185430 29-Nov-2008 pjd

MFp4: Remove assertion that is no longer valid - we now use VOP_CLOSE() in
more places (ie vdev_file.c).


185321 25-Nov-2008 trasz

MFp4: We don't support TX_CREATE_ACL_ATTR nor TX_MKDIR_ACL_ATTR; code found
in zfs_replay.c will panic if it encounters transactions of this type.
Make sure we don't put these into the ZIL.

Approved by: rwatson (mentor), pjd


185319 25-Nov-2008 pjd

Fix locking (file descriptor table and Giant around VFS).

Most submitted by: kib
Reviewed by: kib


185310 25-Nov-2008 ganbold

Remove unused variable.

Found with: Coverity Prevent(tm)
CID: 3669,3671

Approved by: jb


185228 23-Nov-2008 pjd

Don't use PRIV_ROOT. Here we check if user can share ZFS file system, so
PRIV_NFS_DAEMON seems best choice.

Discussed with: rwatson


185174 22-Nov-2008 pjd

IFp4: Don't rely on disk IDs and always use vdev guids, which means always look
up for components by reading metadata. This might be slower when there are big
number of disks in the system, but is definiately more reliable.


185172 22-Nov-2008 pjd

IFp4: Finish implemnetation of chflags(2) for ZFS. While doing this I found
that zfs_access() can only handle VREAD, VWRITE and VEXEC, for the rest we need
to use vaccess(9).


185171 22-Nov-2008 pjd

IFp4: Don't free pathname too soon, debugging code is still using it.


185154 21-Nov-2008 dfr

Add definitions for ZFS pool version 13.


185097 19-Nov-2008 dfr

Some zfsboot fixes from Norikatsu Shigemura:

1. zfsboot2 (boot2) doesn't %d (printf), so change %d to %u.
2. chase new zpool versioning as SPA_VERSION.
Obtained from: sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h

Submitted by: nork


185029 17-Nov-2008 pjd

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


184770 08-Nov-2008 trasz

Require write access on a directory being moved from one parent
directory to another in ZFS.

Approved by: rwatson (mentor), pjd


184740 06-Nov-2008 trasz

Backoff the last patch. It was overly restrictive - we want to check
for write permission on target only when moving the target between two
directories.

Approved by: rwatson (mentor)


184737 06-Nov-2008 trasz

Change ZFS behaviour to match UFS: when moving (rename(2)) a subdirectory
from one parent directory to another, in addition to the usual access checks
one also needs write access to the subdirectory being moved.

Approved by: rwatson (mentor), pjd


184701 05-Nov-2008 rodrigc

Remove definition of KMEM_DEBUG accidentally brought in by latest DTrace
import.

Noticed by: thompsa


184698 05-Nov-2008 rodrigc

Merge latest DTrace changes from Perforce.


184413 28-Oct-2008 trasz

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


183754 10-Oct-2008 attilio

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


183417 27-Sep-2008 jb

Disable use of the user credentials until there is code to set the levels
that DTrace uses.

This fixes a bug that would have affected kernels built with MAC and all
kernels built after the mpsafetty integration.

The bug will be apparent in RELENG7 on MAC kernels.

Reported by: kan


183397 27-Sep-2008 ed

Replace all calls to minor() with dev2unit().

After I removed all the unit2minor()/minor2unit() calls from the kernel
yesterday, I realised calling minor() everywhere is quite confusing.
Character devices now only have the ability to store a unit number, not
a minor number. Remove the confusion by using dev2unit() everywhere.

This commit could also be considered as a bug fix. A lot of drivers call
minor(), while they should actually be calling dev2unit(). In -CURRENT
this isn't a problem, but it turns out we never had any problem reports
related to that issue in the past. I suspect not many people connect
more than 256 pieces of the same hardware.

Reviewed by: kib


183381 26-Sep-2008 ed

Remove unit2minor() use from kernel code.

When I changed kern_conf.c three months ago I made device unit numbers
equal to (unneeded) device minor numbers. We used to require
bitshifting, because there were eight bits in the middle that were
reserved for a device major number. Not very long after I turned
dev2unit(), minor(), unit2minor() and minor2unit() into macro's.
The unit2minor() and minor2unit() macro's were no-ops.

We'd better not remove these four macro's from the kernel, because there
is a lot of (external) code that may still depend on them. For now it's
harmless to remove all invocations of unit2minor() and minor2unit().

Reviewed by: kib


183154 18-Sep-2008 imp

Mips needs the same treatment for atomic_or_8 as the other RISCy
architectures.


183037 15-Sep-2008 pjd

Add missing ZFS_EXIT().

PR: kern/124899
Submitted by: Masakazu Asama <m-asama@ginzado.ne.jp>


182905 10-Sep-2008 trasz

Remove VSVTX, VSGID and VSUID. This should be a no-op,
as VSVTX == S_ISVTX, VSGID == S_ISGID and VSUID == S_ISUID.

Approved by: rwatson (mentor)


182840 07-Sep-2008 pjd

Initialize vp, so we don't call VOP_UNLOCK() with NULL vnode pointer.

Confirmed by: marcus


182824 06-Sep-2008 pjd

Lock vnode exclusively around insmntque().


182781 05-Sep-2008 pjd

Catch up after last insmntque() changes:
- The vnode has to be locked exclusively before calling insmntque().
- Until I find a way to handle insmntque() failures use VV_FORCEINSMQ flag
to force insmntque() to always succeed.

Reported by: kris, trasz, des, others
Suggested by: kib
Tested by: trasz


182542 31-Aug-2008 attilio

Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.

Manpages are updated accordingly.

Tested by: Diego Sardina <siarodx at gmail dot com>


182448 29-Aug-2008 scottl

Ensure that the padding calcualtion doesn't return a negative value.

Submitted by: kib
Approved by: jb


182371 28-Aug-2008 attilio

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


182031 23-Aug-2008 imp

Add MIPS support.

Reviewed by: jb@


181879 19-Aug-2008 jb

Add calls to callout_drain() to ensure the callouts are flushed before
we free memory from underneath them.

This fixes an occasional panic I've been seeing in softclock() where a bad
pointer would be encountered when pushing DTrace hard.


180660 21-Jul-2008 pjd

We want to use LBOLT instead of lbolt on FreeBSD.
I've this already fixed in p4, but the fix was never integrated into HEAD.

Reported by: ed


180652 21-Jul-2008 pjd

We want to check new options given, not the current ones.
This fixes 'zpool import -o <mntopt> <name>' not working properly.


179758 12-Jun-2008 ed

Remove the $FreeBSD$ tag again, now I know fbsd:nokeywords exists.

Requested by: pjd
Approved by: philip (mentor)


179757 12-Jun-2008 ed

Turn dev2unit(), minor(), unit2minor() and minor2unit() into macro's.

Now that we got rid of the minor-to-unit conversion and the constraints
on device minor numbers, we can convert the functions that operate on
minor and unit numbers to simple macro's. The unit2minor() and
minor2unit() macro's are now no-ops.

The ZFS code als defined a macro named `minor'. Change the ZFS code to
use umajor() and uminor() here, as it is the correct approach to do
this. Also add $FreeBSD$ to keep SVN happy.

Approved by: philip (mentor), pjd


179726 11-Jun-2008 ed

Don't enforce unique device minor number policy anymore.

Except for the case where we use the cloner library (clone_create() and
friends), there is no reason to enforce a unique device minor number
policy. There are various drivers in the source tree that allocate unr
pools and such to provide minor numbers, without using them themselves.

Because we still need to support unique device minor numbers for the
cloner library, introduce a new flag called D_NEEDMINOR. All cdevsw's
that are used in combination with the cloner library should be marked
with this flag to make the cloning work.

This means drivers can now freely use si_drv0 to store their own flags
and state, making it effectively the same as si_drv1 and si_drv2. We
still keep the minor() and dev2unit() routines around to make drivers
happy.

The NTFS code also used the minor number in its hash table. We should
not do this anymore. If the si_drv0 field would be changed, it would no
longer end up in the same list.

Approved by: philip (mentor)


179469 01-Jun-2008 jb

Merge a recent change from the OpenSolaris source tree.
(Don't ask for a vendor import of this yet, we're in the early days of svn)

Instead of using cyclic timers to call the state clean and deadman callbacks,
use a callout on FreeBSD to avoid the deadlock on FreeBSD due to trying to
send interprocessor interrupts with interrupts disabled.

Reported by: ps, jhb, peter, thompsa


179310 25-May-2008 pjd

Fix namespace collision after src/sys/sys/file.h:1.78.


179307 25-May-2008 jb

Comment out the code that breaks with invariants. This is stuff that is
still WIP along with the lockstat provider, so there is no harm leaving
it out for now.


179293 24-May-2008 bz

Remove redundant redeclaration of 'zone_drain'.


179280 24-May-2008 jb

Make the zfs module depend on the opensolaris module in preparation for it
to shared stuff with the DTrace modules.


179266 23-May-2008 jb

Messing with the endian defines breaks the use of other FreeBSD headers.


179264 23-May-2008 jb

Delete a couple of OpenSolaris headers which get in the way of our
implementation.


179263 23-May-2008 jb

OpenSolaris kernel module compatibility sources.


179260 23-May-2008 jb

The cyclic timer device. This is a cut down version of the one in
OpenSolaris. We don't have the lock levels that they do, so this is just
hooked into clock interrupts.


179237 23-May-2008 jb

Custom DTrace kernel module files plus FreeBSD-specific DTrace providers.


179204 22-May-2008 jb

A 'special' compatibility header to plug OpenSolaris code.


179203 22-May-2008 jb

Additional compatibility headers.


179202 22-May-2008 jb

Compatibility stuff for DTrace.


179198 22-May-2008 jb

FreeBSD changes to vendor source.


179194 22-May-2008 jb

This commit was generated by cvs2svn to compensate for changes in r179193,
which included commits to RCS files with non-trunk default branches.


179035 16-May-2008 attilio

LO_ENROLLPEND is no more existing so just axe it (it was left out by the
original commit axing it).


178414 22-Apr-2008 jb

Add FreeBSD IDs to files that originate in FreeBSD.


178243 16-Apr-2008 kib

Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks


178129 11-Apr-2008 marius

Add atomic operations for ZFS/sparc64.

Approved by: core, pjd
Obtained from: OpenSolaris (w/ adaptations)
MFC after: 2 weeks


178127 11-Apr-2008 marius

- Fix the path encoded in the multiple inclusion protection.
- GCC uses 32-byte function alignment for UltraSPARC CPUs.
- Remove code duplication.

Approved by: core, pjd
MFC after: 2 weeks


177633 26-Mar-2008 dfr

Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.

* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.

Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks


177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


177230 15-Mar-2008 pjd

Fix mmap(2) on ZFS after some changes in VM subsystem.

Submitted by: alc
Reported by: kris (originally) and many others
Tested with: fsx
MFC after: 1 week


176559 25-Feb-2008 attilio

Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by: Andrea Barberio <insomniac at slackware dot it>


176519 24-Feb-2008 attilio

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


175633 24-Jan-2008 pjd

- Reduce how much ZFS caches by default. This is another change to mitigate
'kmem_map too small panics'.
- Print two warnings if there is not enough memory and not enough address
space.
- Improve comment.


175632 24-Jan-2008 pjd

Change type of kmem_used() and kmem_size() functions to uint64_t, so it
doesn't overflow in arc.c in this check:

if (kmem_used() > (kmem_size() * 4) / 5)
return (1);

With this bug ZFS almost doesn't cache.

Only 32bit machines are affected that have vm.kmem_size set to values >=1GB.

Reported by: David Taylor <davidt@yadt.co.uk>


175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


175202 10-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


174065 29-Nov-2007 jb

Remove some compatibility stuff that we now get from the Solaris header.


174049 28-Nov-2007 jb

* Check endianness the FreeBSD way.

* Use LBOLT rather than lbolt to avoid a clash with a FreeBSD global
variable.


174048 28-Nov-2007 jb

Fix a prototype definition.


174047 28-Nov-2007 jb

Check endianness the FreeBSD way.


174046 28-Nov-2007 jb

Include an extra header to get this to compile cleanly.


174042 28-Nov-2007 jb

Add more OpenSolaris compatibility headers.


174041 28-Nov-2007 jb

Remove an extern that is defined elsewhere.


174040 28-Nov-2007 jb

Add compatibility cruft moved from under _SOLARIS_C_SOURCE in sys/types.h


174039 28-Nov-2007 jb

Remove a typedef which was just a hack to avoid including vmem.h.
That typedef breaks other Solaris code.


174037 28-Nov-2007 jb

Add a missing volatile so that the code compiles cleanly.


174036 28-Nov-2007 jb

Rename the definition of lbolt to LBOLT to avoid a clash with a global
variable in FreeBSD. Until now lbolt in sys/proc.h has been #ifdef'ed
out based on _SOLARIS_C_SOURCE, but that is going away now.


173419 07-Nov-2007 pjd

Warn if kmem_map size is set to less than 512MB. Previous warning was a bit
pointless, because default is set to something around 300MB and also
insufficient.

MFC after: 3 days


173374 05-Nov-2007 pjd

Remove unused header.

MFC after: 3 days


173373 05-Nov-2007 pjd

If setting a state to anything but open state, close access to vdev.
This fixes replacing drive in place, eg. zpool replace tank da1 da1.
Before it complained that device is already open.

MFC after: 1 week


173371 05-Nov-2007 pjd

Remove "zfs:" prefix from lock and condvar names and also skip non-letter
characters (mostly "&"). Because top(1) shows only first six characters of
wait channel, without this change we saw only one meaningful character.

Requested by: kris & others
MFC after: 1 week


173268 02-Nov-2007 lulf

- Add sysctl for sizeof(znode_t), which will be used by fstat(1).

Approved by: pjd (mentor)


173250 01-Nov-2007 pjd

Call zil_commit() (if ZIL is not disabled) after every non-read request
(BIO_WRITE and BIO_FLUSH) as it is done is Solaris. The difference is
that Solaris calls it only for sync requests, but we can't say in GEOM
is the request is sync or async, so we do it for every request.

MFC after: 1 week


173247 01-Nov-2007 pjd

- Move crfree() outside MNT_ILOCK()/MNT_IUNLOCK() to eliminate a LOR:
1st 0xc4cea568 struct mount mtx (struct mount mtx) @ /usr/src/sys/modules/zfs/../../compat/opensolaris/kern/opensolaris_vfs.c:209
2nd 0xc3ee9010 sleep mtxpool (sleep mtxpool) @ /usr/src/sys/kern/kern_resource.c:1266
- Move crdup() outside MNT_ILOCK()/MNT_IUNLOCK(), as it can sleep.

Reported by: Olli Hauer <ohauer@gmx.de>
MFC after: 3 days


172836 20-Oct-2007 julian

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


172645 14-Oct-2007 thompsa

ZFS_LOG adds a newline by itself.

Pointed out by: pjd


172624 14-Oct-2007 thompsa

Print the ZFS ereport to the console if vfs.zfs.debug is set to help diagnose
problems with zfs-on-root since devd isnt running yet.

Reviewed by: pjd


172443 04-Oct-2007 pjd

Fix lock leak leading to the 'System call <name> returning with 1 locks held'
panic.

Reported by: kris
Approved by: re (kensmith)


172301 23-Sep-2007 pjd

Now that we have CDDLed code in the tree, add CDDL license.

Discussed with: core
Approved by: re (kensmith)


172135 10-Sep-2007 pjd

Reduce the limit of vnodes on i386 when ZFS is loaded to 3/4 of the original
value, so we don't run out of KVA. The default vnodes limit fits better for
UFS, but ZFS allocated more file system specific memory for a vnode than UFS.

Don't touch vnodes limit if we detect it was tuned by system administrator
and restore original value when ZFS is unloaded.

This isn't final fix, but before we implement something better, this will
help to stabilize ZFS under heavy load on i386.

Approved by: re (bmah)


172130 10-Sep-2007 pjd

After dfr@ vnode leak fix, we can allow ARC to consume more memory.

Tested by: kris
Approved by: re (bmah)


172030 01-Sep-2007 pjd

Use CTLFLAG_RDTUN for tunable sysctls.

Approved by: re (bmah)


171863 16-Aug-2007 pjd

Some ZFS threads needs stack larger than the default 8kB, so use 16kB of
alternate stack if the default is smaller than 16kB.

Approved by: re (rwatson)


171567 24-Jul-2007 pjd

Update assertion after revision 1.23.

Reviewed by: dfr
Approved by: re (rwatson)


171316 09-Jul-2007 dfr

Correct a reference-counting mistake in the ZFS code which led to abnormal
memory usage and pessimal cache performance.

Reviewed by: pjd
Approved by: re (rwatson)


171063 27-Jun-2007 dfr

In zfs_vget, if we fail to translate an inode number to the corresponding
vnode, make sure we return an error code to the caller.

Reviewed by: pjd
Approved by: re


170587 12-Jun-2007 rwatson

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


170437 08-Jun-2007 marcel

Add my copyright.

Requested by: pjd@


170431 08-Jun-2007 pjd

- Reduce number of atomic operations needed to be implemented in asm by
implementing some of them using existing ones.
- Allow to compile ZFS on all archs and use atomic operations surrounded
by global mutex on archs we don't have or can't have all atomic
operations needed by ZFS.


170430 08-Jun-2007 pjd

Missing atomic operations for ZFS/ia64.

Submitted by: marcel


170289 04-Jun-2007 dwmalone

Despite several examples in the kernel, the third argument of
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.

Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.

In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported. In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.


170281 04-Jun-2007 pjd

Reimplement traverse() helper function:
1. Pass locking flags to VFS_ROOT().
2. Check v_mountedhere while the vnode is locked.
3. Always return locked vnode on success.

Change 1 fixes problem reported by Stephen M. Rumble - after
zfs_vfsops.c,1.9 change, zfs_root() no longer locks the vnode
unconditionally and traverse() didn't pass right lock type to
VFS_ROOT(). The result was that kernel paniced when .zfs/ directory
was accessed via NFS.


170152 31-May-2007 kib

Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)


170044 28-May-2007 pjd

Adjust va_mask for setattr. FreeBSD doesn't have va_mask, so we initialize it
based on individual fields beeing set. This doesn't work for setattr replay,
because va_type is set there, so we add AT_TYPE flag to va_mask, which won't
be accepted by zfs_setattr().

Reported by: kris


170040 28-May-2007 pjd

Because we allocate componentname structures on stack, bzero() them before
use just in case.


170006 26-May-2007 pjd

There are too many false positive LORs reported by WITNESS, so when ZFS
debug is turned off, initialize locks with NOWITNESS flag.
At some point I'll get back to them, we would probably need BLESSING
functionality, which is currently turned off by default.


169934 24-May-2007 pjd

DNLC_NO_VNODE can't be NULL.

Reported by: ru


169929 24-May-2007 pjd

Initialize ZFS a bit earlier and block root mounting until
initialization is complete. This fixes some root-on-ZFS
configurations.

Reported by: Bruno Damour <freebsd.ruomad@free.fr>
Tested by: Bruno Damour <freebsd.ruomad@free.fr>


169920 23-May-2007 pjd

FreeBSD's namecache works quite well with ZFS, so remove DNLC.


169919 23-May-2007 pjd

All objects we create using GFS are directories, so initialize d_type
properly, but add XXX comment saying that it can eventually change in
the future.


169884 22-May-2007 pjd

Lock vnode on lookup. This fixes ZIL replay for rmdir/unlink/rename.

Reported by: des


169430 09-May-2007 pjd

Increase debug level - this message is not that important.


169325 06-May-2007 pjd

- Add missing lock destruction and remove duplicate initializations.
With this change it is possible to unload zfs.ko module from
WITNESS-enabled kernel.
- Remove bogus comment.


169303 06-May-2007 pjd

Use provider's ident to handle situations when disks are moved around
and show up with different names: first try to open provider using
remembered name and compare its ident, if equal, this is our provider,
if not equal or there is no provider with such name, find provider with
remembered ident and don't care about the name.


169302 06-May-2007 pjd

MFp4: We don't need to cover vnode_pager_setsize() with the z_map_lock.


169199 02-May-2007 pjd

Share-lock a vnode where possible.


169198 02-May-2007 pjd

When parent directory has to be unlocked, lock it back with the same lock
type. Before this change, if directory was shared-locked, it was relocked
exclusively.


169197 02-May-2007 pjd

Lock vnode using cn_lkflags in case the caller wants the vnode to be
shared-locked.


169196 02-May-2007 pjd

The getnewvnode() function sets LK_NOSHARE by default, so if we want to
support shared vnodes locking, we need to remove that flag.
Also add LK_CANRECURSE flag as found in nfsclient.


169195 02-May-2007 pjd

ZFS should update timestamps upon the creat() of an existing file.

Obtained from: OpenSolaris
Bug: http://bugs.opensolaris.org/view_bug.do?bug_id=6465105


169194 02-May-2007 pjd

- Lock vnode with flags passed in as argument in zfs_vget() and zfs_root().

Pointed out by: ups
Also reported by: kris

- Add comments where I'm not sure if LK_RETRY should be used.


169172 01-May-2007 pjd

MFp4: Remove LK_RETRY flag when locking vnode in zfs_lookup, we don't want
dead vnodes here.

Suggested by: kib


169170 01-May-2007 pjd

White space fixes.


169167 01-May-2007 pjd

Add a comment explaining why we call dmu_write() unconditionally, even if
uiomove() fails, especially that it is different from what OpenSolaris
does (I'm not entirely sure they are right).

Suggested by: darrenr


169108 29-Apr-2007 pjd

- Define d_type for ".", ".." and ".zfs" directories.
- Add a TODO comment where d_type is still noe defined.


169107 29-Apr-2007 pjd

Oops, correct important typo in last commit.


169106 29-Apr-2007 pjd

Avoid freeing NULL pointer in case of an error.


169087 29-Apr-2007 pjd

Fix two use-after-free cases.


169059 26-Apr-2007 pjd

MFp4: Optimize mappedwrite() and mappedread() functions to write/read as much
non-mapped data as possible at once and not page-by-page. Which this change we
combain I/Os, but also saves many VM_OBJECT_UNLOCK()/VM_OBJECT_LOCK()
operations.

Simple 'fsx -l 33554432 -o 524288 -N 10000 /tank/fsx' test shows ~23%
performance increase.


169057 26-Apr-2007 pjd

- Always try to write one whole page at a time.
- vm_page_undirty() is enough (instead of vm_page_set_validclean()), but it has
to be called before we write the data in case someone makes page dirty after
our write, but before our vm_page_undirty() call.
- Always dmu_write, not matter if uiomove() succeeded, because it could
partially be ok and we would lose some changes.

All good ideas from: ups


169056 26-Apr-2007 pjd

MFV: Free znodes immediatelly, allowing the ARC to hold onto less memory.

Full description at: http://bugs.opensolaris.org/view_bug.do?bug_id=6543706


169055 26-Apr-2007 pjd

MFV: Functions name change.


169028 24-Apr-2007 pjd

ZIL (ZFS Intent Log) can be safely turned on and off at run time, because
it is only used when dataset is beeing mounted to decide if log should also
be opened.


169027 24-Apr-2007 pjd

MFp4: Now that ZFS can use FreeBSD's namecache, turn it off by default and
turn off DNLC, but don't remove DNLC yet just in case.


169025 24-Apr-2007 pjd

MFp4: Rearange the code so vobject is destroyed from reclaim() method like
in all other file system on FreeBSD (instead from inactive() method).

A nice side-effect of this change, except that it speedups file system
when mmaped file are often open/closed, is that it makes FreeBSD's
namecache work:)


169024 24-Apr-2007 pjd

MFp4: Once page is written successfully, we should clear the dirty bits.
This fixes slow operations on mmaped files, because without this fix,
pages were written to disk multiple times.

If one is looking for even greater speed up for such operation, he should
disable ZIL (by setting vfs.zfs.zil_disable to 1 in /boot/loader.conf).
Disabling ZIL makes fsx run ~9 times faster.


169023 24-Apr-2007 pjd

MFp4: Reduce diff against vendor.


169022 24-Apr-2007 pjd

MFp4: We have stronger 'lock already initialized' check now, so we can
reduce diff against the vendor by removing bzero of this mutex.


168987 23-Apr-2007 bmah

Mostly-cosmetic fixes in low-memory warning messages:

o Fix linewrap issues.

o Fix two typos (s/Recomended/Recommended/ and s/tunning/tuning/)

o Remove a couple of extra instances of the word "of".

o Update names of kmem_size variables.

Approved by: pjd


168978 23-Apr-2007 pjd

Too much diff reduction. 'cmd' has to be u_long.

Reported by: delphij


168962 23-Apr-2007 pjd

MFp4: Reduce diff against vendor code:
- Move FreeBSD-specific code to zfs_freebsd_*() functions in zfs_vnops.c
and keep original functions as similar to vendor's code as possible.
- Add various includes back, now that we have them.


168959 22-Apr-2007 pjd

Fix 'zpool status -v'. To get object number we should use ZFS_DIRENT_OBJ()
macro, as za_first_integer field also contains type. This should be fixed in
ZFS itself, but this bug is not visible on Solaris, because there, type is
not stored in za_first_integer. On the other hand it will be visible on
MacOS X.

Reported by: Barry Pederson <bp@barryp.org>


168958 22-Apr-2007 pjd

Fix st_rdev handling (implement it, actually).

Reported by: gj


168926 21-Apr-2007 pjd

MFp4:

@118370 Correct typo.

@118371 Integrate changes from vendor.

@118491 Show backtrace on unexpected code paths.

@118494 Integrate changes from vendor.

@118504 Fix sendfile(2). I had two ways of fixing it:
1. Fixing sendfile(2) itself to use VOP_GETPAGES() instead of
hacking around with vn_rdwr(UIO_NOCOPY), which was suggested
by ups.
2. Modify ZFS behaviour to handle this special case.

Although 1 is more correct, I've choosen 2, because hack from 1
have a side-effect of beeing faster - it reads ahead MAXBSIZE
bytes instead of reading page by page. This is not easy to implement
with VOP_GETPAGES(), at least not for me in this very moment.

Reported by: Andrey V. Elsukov <bu7cher@yandex.ru>

@118525 Reorganize the code to reduce diff.

@118526 This code path is expected. It is simply when file is opened with
O_FSYNC flag.

Reported by: kris
Reported by: Michal Suszko <dry@dry.pl>


168840 18-Apr-2007 pjd

MFp4: Fix automatic snapshot mount when unprivileged user does lookup
on a snapshot directory:
- Remove PRIV_VFS_MOUNT check - regular users can mount snapshots
via lookups on snapshot directory.
- Reset mount credential to kcred, so user won't be able to unmount
the snapshot.
- Reset owner uid.
- Unlock vnode in case of a failure.

Reported by: simokawa


168839 18-Apr-2007 pjd

MFp4: We check for PRIV_VFS_MOUNT already in mount(2) syscall and we don't
want to do the check when snapshot is automatically mounted by an
unprivileged user doing lookup on a snapshot directory.


168826 17-Apr-2007 pjd

Simplify.


168824 17-Apr-2007 pjd

- Fix a leftover - vfs_mount_alloc() is now exported properly.
This fixes stange panics when listing .zfs/snapshot/ directory for me.
Reported by: simokawa
Reported by: Johan Hendriks <Johan@double-l.nl>
- Hide cache_purge() under FREEBSD_NAMECACHE like in other files.
- Protect mnt_flag with mount interlock.


168821 17-Apr-2007 pjd

Ignore hostid check for root-on-ZFS configurations. Making hostid available
before the root is mounted is tricky and having it in /boot/ is not really
desire.

Reported by: Zephiris <zephiris@gmail.com>


168775 16-Apr-2007 pjd

Uncomment forgotten check. Without this check in-place, ZFS will panic on
unload instead of returning EBUSY. This check tells if there are mounted
ZFS file systems or not. We can't unload if there are mounted file systems.

Reported by: Andrey V. Elsukov <bu7cher@yandex.ru>


168753 15-Apr-2007 pjd

MFp4: Start DNLC after desiredvnodes variable is initialized.
Before this change if zfs.ko was loaded by the loader, DNLC was
automatically disabled.

Reported by: Zephiris <zephiris@gmail.com>


168738 14-Apr-2007 pjd

Fix RAID-Z resilvering.

Obtained from: OpenSolaris


168724 14-Apr-2007 pjd

MFp4: Hmm, it seems to work now.


168715 14-Apr-2007 pjd

MFp4: Use max_ncpus, which is used in other places in the code.


168714 14-Apr-2007 pjd

MFp4: Add more debug, so we can see if zpool.cache was loaded or why it
wasn't loaded.


168713 14-Apr-2007 pjd

MFp4: Allow to tune vfs.zfs.debug from loader.conf.


168712 14-Apr-2007 pjd

MFp4: - Allow to tune number of spa_zio_* threads.
- Reduce default number of spa_zio_* threads to N*spa_zio_issue
plus N*spa_zio_intr threads per ZIO type, where N is the number
of CPUs.
- Put ZIO type number in thread's name.


168696 13-Apr-2007 pjd

Fix overflow, which was causing endless loops when 32bit machine had more
than 2GB of RAM. This was because our physmem is long and 'physmem*PAGESIZE'
can be negative for more than 2GB of memory.

Reported by: Andrey V. Elsukov <bu7cher@yandex.ru>

It is not yet tested by Andrey, so there can be other problems, but this
was definiately a bug, so I'm committing a fix now.


168683 13-Apr-2007 pjd

Fix vnodes starvation caused by DNLC (ZFS name cache):
- Tune number of namecache entires better (based on desiredvnodes).
- Handle vfs_lowvnodes event by releasing requested number of name cache
entries, but no less than 5%.

Reported by: simokawa


168676 12-Apr-2007 pjd

MFp4: Synchronize with vendor (mostly 'zfs rename -r').


168675 12-Apr-2007 pjd

MFp4: Bring back comments.

Requested by: jhb


168604 10-Apr-2007 wkoszek

strchr() and strrchr() are already present in the kernel, but with less
popular names. Hence:

- comment current index() and rindex() functions, as these serve the same
functionality as, respectively, strchr() and strrchr() from userland;
- add inlined version of strchr() and strrchr(), as we tend to use them more
often;
- remove str[r]chr() definitions from ZFS code;

Reviewed by: pjd
Approved by: cognet (mentor)


168583 10-Apr-2007 pjd

MFp4: Allow to set zfs_recover via vfs.zfs.recover from /boot/loader.conf.


168582 10-Apr-2007 pjd

MFp4: Hide under '#ifdef _KERNEL' only what's really needed.


168566 10-Apr-2007 pjd

Try to stabilize ZFS with regard to memory consumption:
- Allow to shrink ARC down to 16MB (instead of 64MB).
- Set arc_max to 1/2 of kmem_map by default.
- Start freeing things earlier when low memory situation is detected.
- Serialize execution of arc_lowmem().

I decided to setup minimum ZFS memory requirements to 512MB of RAM and 256MB of
kmem_map size. If there is less RAM or kmem_map, a warning will be printed.
World is cruel, be no better. In other words: modern file system requires
modern hardware:)

From ZFS administration guide:

"Currently the minimum amount of memory recommended to install a Solaris
system is 512 Mbytes. However, for good ZFS performance, at least one
Gbyte or more of memory is recommended."


168565 10-Apr-2007 pjd

Reduce diff against vendor - we have now stronger check for "mutex already
initialized", so we can go back to kmem_alloc().


168559 09-Apr-2007 pjd

Remove unused #define.


168514 09-Apr-2007 pjd

Instead of detecting if lock is already initialized based on standard 1 bit
check, use more accurate 13 bits check. We had too many false-positives with
the standard check.

Reported by: mlaier


168511 09-Apr-2007 pjd

We don't have to wait for the root file system to be mounted anymore, now that
kobj KPI supports operating on files loaded by the loader.


168510 09-Apr-2007 pjd

Drop the Giant lock before calling zfs_domount(), which is held when
mounting root file system.


168509 08-Apr-2007 pjd

Move zpool.cache from /etc/zfs/ to /boot/zfs/, so we can keep it on
dedicated /boot/ file system and use ZFS for the root file system.


168508 08-Apr-2007 pjd

Extend kobj compatibility KPI to support operating on files before and
after the root file system is mounted.
This is one of the changes that will allow to put root file system on ZFS.


168498 08-Apr-2007 pjd

MFp4: Synchronize with recent OpenSolaris changes.


168494 08-Apr-2007 pjd

- Use 'name=value' so it can be properly recognized by devd(8).
- Use only subclass as devd's type.


168488 08-Apr-2007 pjd

Take vnode pointer and hold it under znode lock, so we won't race with
zfs_reclaim(). This may or may not fix problem reported by kris, but it's
definiatelly better that way.


168482 07-Apr-2007 pjd

Move atomic.S files to directories that better fit OpenSolaris directory
layout.


168481 07-Apr-2007 pjd

Fix libzpool compilation.

Reported by: des


168478 07-Apr-2007 pjd

Limit the number of system taskq threads to the number of CPUs.
They are only used when there is a need for reducing namecache.

Observed by: kris, csjp


168474 07-Apr-2007 des

Fix some type mismatches.

Reviewed by: pjd@


168473 07-Apr-2007 pjd

Allow to tune maximum and minimum memory used by ARC.


168460 07-Apr-2007 pjd

Add missing mutex_init() which was causing assertion panic when on clone
destruction.

Reported by: kris


168404 06-Apr-2007 pjd

Please welcome ZFS - The last word in file systems.

ZFS file system was ported from OpenSolaris operating system. The code in under
CDDL license.

I'd like to thank all SUN developers that created this great piece of software.

Supported by: Wheel LTD (http://www.wheel.pl/)
Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/)
Supported by: Sentex (http://www.sentex.net/)