History log of /netbsd-current/sys/kern/vfs_vnops.c
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.242 10-Jul-2023 christos

Add memfd_create(2) from GSoC 2023 by Theodore Preduta


# 1.241 22-Apr-2023 riastradh

file(9): New fo_posix_fadvise operation.

XXX kernel revbump -- changes struct fileops API and ABI


# 1.240 22-Apr-2023 riastradh

file(9): New fo_fpathconf operation.

XXX kernel revbump -- struct fileops API and ABI change


# 1.239 22-Apr-2023 riastradh

file(9): New fo_advlock operation.

This moves the vnode-specific logic from sys_descrip.c into
vfs_vnode.c, like we did for fo_seek.

XXX kernel revbump -- struct fileops API and ABI change


# 1.238 22-Apr-2023 riastradh

readdir(2), lseek(2): Fix races in access to struct file::f_offset.

For non-directory vnodes:
- reading f_offset requires a shared or exclusive vnode lock
- writing f_offset requires an exclusive vnode lock

For directory vnodes, access (read or write) requires either:
- a shared vnode lock AND f_lock, or
- an exclusive vnode lock.

This way, two files for the same underlying directory vnode can still
do VOP_READDIR in parallel, but if two readdir(2) or lseek(2) calls
run in parallel on the same file, the load and store of f_offset is
atomic (otherwise, e.g., on 32-bit systems it might be torn and lead
to corrupt offsets).

There is still a potential problem: the _whole transaction_ of
readdir(2) may not be atomic. For example, if thread A and thread B
read n bytes of directory content, thread A might get bytes [0,n) and
thread B might get bytes [n,2n) but f_offset might end up at n
instead of 2n once both operations complete. (However, f_offset
wouldn't be some corrupt garbled number like n & 0xffffffff00000000.)
Fixing this would require either:
(a) using an exclusive vnode lock in vn_readdir,
(b) introducing a new lock that serializes vn_readdir on the same
file (but ont necessarily the same vnode), or
(c) proving it is safe to hold f_lock across VOP_READDIR, VOP_SEEK,
and VOP_GETATTR.


# 1.237 13-Mar-2023 riastradh

vn_open(9): Add assertion that vp is locked on return.

Null out vp internally out of paranoia so we'll crash in evaluating
the assertion if we ever reach it via one of the vput paths.


# 1.236 13-Mar-2023 riastradh

vn_open(9): Clarify that this returns a locked vnode.

Comment only, no functional change intended.


Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.235 06-Aug-2022 riastradh

vnodeops(9): Take exclusive lock in read/seek for f_offset update.

Otherwise concurrent readers/seekers might clobber it.


# 1.234 18-Jul-2022 thorpej

Make kqueue event status for vnodes shareable, and for stacked file systems
like nullfs, make the upper vnode share that status with the lower vnode.

And, lo, NetBSD 9.99.99.

Fixes PR kern/56713.


# 1.233 06-Jul-2022 riastradh

kern: Work around spurious -Wtype-limits warnings.

This useless garbage warning is apparently designed to make it
painful to write portable safe arithmetic and I think we ought to
just disable it.


# 1.232 06-Jul-2022 riastradh

kern/vfs_vnops.c: Fix missing semicolon in previous.

Neglected to build and amend commit, oops.


# 1.231 06-Jul-2022 riastradh

kern/vfs_vnops.c: Sprinkle KNF.

No functional change intended.


# 1.230 06-Jul-2022 riastradh

mmap(2): Avoid overflow in overflow check in vn_mmap.


# 1.229 06-Jul-2022 riastradh

uvm(9): fo_mmap caller guarantees positive size.

No functional change intended, just sprinkling assertions to make it
clearer.


# 1.228 22-May-2022 andvar

fix various small typos, mainly in comments.


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.241 22-Apr-2023 riastradh

file(9): New fo_posix_fadvise operation.

XXX kernel revbump -- changes struct fileops API and ABI


# 1.240 22-Apr-2023 riastradh

file(9): New fo_fpathconf operation.

XXX kernel revbump -- struct fileops API and ABI change


# 1.239 22-Apr-2023 riastradh

file(9): New fo_advlock operation.

This moves the vnode-specific logic from sys_descrip.c into
vfs_vnode.c, like we did for fo_seek.

XXX kernel revbump -- struct fileops API and ABI change


# 1.238 22-Apr-2023 riastradh

readdir(2), lseek(2): Fix races in access to struct file::f_offset.

For non-directory vnodes:
- reading f_offset requires a shared or exclusive vnode lock
- writing f_offset requires an exclusive vnode lock

For directory vnodes, access (read or write) requires either:
- a shared vnode lock AND f_lock, or
- an exclusive vnode lock.

This way, two files for the same underlying directory vnode can still
do VOP_READDIR in parallel, but if two readdir(2) or lseek(2) calls
run in parallel on the same file, the load and store of f_offset is
atomic (otherwise, e.g., on 32-bit systems it might be torn and lead
to corrupt offsets).

There is still a potential problem: the _whole transaction_ of
readdir(2) may not be atomic. For example, if thread A and thread B
read n bytes of directory content, thread A might get bytes [0,n) and
thread B might get bytes [n,2n) but f_offset might end up at n
instead of 2n once both operations complete. (However, f_offset
wouldn't be some corrupt garbled number like n & 0xffffffff00000000.)
Fixing this would require either:
(a) using an exclusive vnode lock in vn_readdir,
(b) introducing a new lock that serializes vn_readdir on the same
file (but ont necessarily the same vnode), or
(c) proving it is safe to hold f_lock across VOP_READDIR, VOP_SEEK,
and VOP_GETATTR.


# 1.237 13-Mar-2023 riastradh

vn_open(9): Add assertion that vp is locked on return.

Null out vp internally out of paranoia so we'll crash in evaluating
the assertion if we ever reach it via one of the vput paths.


# 1.236 13-Mar-2023 riastradh

vn_open(9): Clarify that this returns a locked vnode.

Comment only, no functional change intended.


Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.235 06-Aug-2022 riastradh

vnodeops(9): Take exclusive lock in read/seek for f_offset update.

Otherwise concurrent readers/seekers might clobber it.


# 1.234 18-Jul-2022 thorpej

Make kqueue event status for vnodes shareable, and for stacked file systems
like nullfs, make the upper vnode share that status with the lower vnode.

And, lo, NetBSD 9.99.99.

Fixes PR kern/56713.


# 1.233 06-Jul-2022 riastradh

kern: Work around spurious -Wtype-limits warnings.

This useless garbage warning is apparently designed to make it
painful to write portable safe arithmetic and I think we ought to
just disable it.


# 1.232 06-Jul-2022 riastradh

kern/vfs_vnops.c: Fix missing semicolon in previous.

Neglected to build and amend commit, oops.


# 1.231 06-Jul-2022 riastradh

kern/vfs_vnops.c: Sprinkle KNF.

No functional change intended.


# 1.230 06-Jul-2022 riastradh

mmap(2): Avoid overflow in overflow check in vn_mmap.


# 1.229 06-Jul-2022 riastradh

uvm(9): fo_mmap caller guarantees positive size.

No functional change intended, just sprinkling assertions to make it
clearer.


# 1.228 22-May-2022 andvar

fix various small typos, mainly in comments.


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.241 22-Apr-2023 riastradh

file(9): New fo_posix_fadvise operation.

XXX kernel revbump -- changes struct fileops API and ABI


# 1.240 22-Apr-2023 riastradh

file(9): New fo_fpathconf operation.

XXX kernel revbump -- struct fileops API and ABI change


# 1.239 22-Apr-2023 riastradh

file(9): New fo_advlock operation.

This moves the vnode-specific logic from sys_descrip.c into
vfs_vnode.c, like we did for fo_seek.

XXX kernel revbump -- struct fileops API and ABI change


# 1.238 22-Apr-2023 riastradh

readdir(2), lseek(2): Fix races in access to struct file::f_offset.

For non-directory vnodes:
- reading f_offset requires a shared or exclusive vnode lock
- writing f_offset requires an exclusive vnode lock

For directory vnodes, access (read or write) requires either:
- a shared vnode lock AND f_lock, or
- an exclusive vnode lock.

This way, two files for the same underlying directory vnode can still
do VOP_READDIR in parallel, but if two readdir(2) or lseek(2) calls
run in parallel on the same file, the load and store of f_offset is
atomic (otherwise, e.g., on 32-bit systems it might be torn and lead
to corrupt offsets).

There is still a potential problem: the _whole transaction_ of
readdir(2) may not be atomic. For example, if thread A and thread B
read n bytes of directory content, thread A might get bytes [0,n) and
thread B might get bytes [n,2n) but f_offset might end up at n
instead of 2n once both operations complete. (However, f_offset
wouldn't be some corrupt garbled number like n & 0xffffffff00000000.)
Fixing this would require either:
(a) using an exclusive vnode lock in vn_readdir,
(b) introducing a new lock that serializes vn_readdir on the same
file (but ont necessarily the same vnode), or
(c) proving it is safe to hold f_lock across VOP_READDIR, VOP_SEEK,
and VOP_GETATTR.


# 1.237 13-Mar-2023 riastradh

vn_open(9): Add assertion that vp is locked on return.

Null out vp internally out of paranoia so we'll crash in evaluating
the assertion if we ever reach it via one of the vput paths.


# 1.236 13-Mar-2023 riastradh

vn_open(9): Clarify that this returns a locked vnode.

Comment only, no functional change intended.


Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.235 06-Aug-2022 riastradh

vnodeops(9): Take exclusive lock in read/seek for f_offset update.

Otherwise concurrent readers/seekers might clobber it.


# 1.234 18-Jul-2022 thorpej

Make kqueue event status for vnodes shareable, and for stacked file systems
like nullfs, make the upper vnode share that status with the lower vnode.

And, lo, NetBSD 9.99.99.

Fixes PR kern/56713.


# 1.233 06-Jul-2022 riastradh

kern: Work around spurious -Wtype-limits warnings.

This useless garbage warning is apparently designed to make it
painful to write portable safe arithmetic and I think we ought to
just disable it.


# 1.232 06-Jul-2022 riastradh

kern/vfs_vnops.c: Fix missing semicolon in previous.

Neglected to build and amend commit, oops.


# 1.231 06-Jul-2022 riastradh

kern/vfs_vnops.c: Sprinkle KNF.

No functional change intended.


# 1.230 06-Jul-2022 riastradh

mmap(2): Avoid overflow in overflow check in vn_mmap.


# 1.229 06-Jul-2022 riastradh

uvm(9): fo_mmap caller guarantees positive size.

No functional change intended, just sprinkling assertions to make it
clearer.


# 1.228 22-May-2022 andvar

fix various small typos, mainly in comments.


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.237 13-Mar-2023 riastradh

vn_open(9): Add assertion that vp is locked on return.

Null out vp internally out of paranoia so we'll crash in evaluating
the assertion if we ever reach it via one of the vput paths.


# 1.236 13-Mar-2023 riastradh

vn_open(9): Clarify that this returns a locked vnode.

Comment only, no functional change intended.


Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.235 06-Aug-2022 riastradh

vnodeops(9): Take exclusive lock in read/seek for f_offset update.

Otherwise concurrent readers/seekers might clobber it.


# 1.234 18-Jul-2022 thorpej

Make kqueue event status for vnodes shareable, and for stacked file systems
like nullfs, make the upper vnode share that status with the lower vnode.

And, lo, NetBSD 9.99.99.

Fixes PR kern/56713.


# 1.233 06-Jul-2022 riastradh

kern: Work around spurious -Wtype-limits warnings.

This useless garbage warning is apparently designed to make it
painful to write portable safe arithmetic and I think we ought to
just disable it.


# 1.232 06-Jul-2022 riastradh

kern/vfs_vnops.c: Fix missing semicolon in previous.

Neglected to build and amend commit, oops.


# 1.231 06-Jul-2022 riastradh

kern/vfs_vnops.c: Sprinkle KNF.

No functional change intended.


# 1.230 06-Jul-2022 riastradh

mmap(2): Avoid overflow in overflow check in vn_mmap.


# 1.229 06-Jul-2022 riastradh

uvm(9): fo_mmap caller guarantees positive size.

No functional change intended, just sprinkling assertions to make it
clearer.


# 1.228 22-May-2022 andvar

fix various small typos, mainly in comments.


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.235 06-Aug-2022 riastradh

vnodeops(9): Take exclusive lock in read/seek for f_offset update.

Otherwise concurrent readers/seekers might clobber it.


# 1.234 18-Jul-2022 thorpej

Make kqueue event status for vnodes shareable, and for stacked file systems
like nullfs, make the upper vnode share that status with the lower vnode.

And, lo, NetBSD 9.99.99.

Fixes PR kern/56713.


# 1.233 06-Jul-2022 riastradh

kern: Work around spurious -Wtype-limits warnings.

This useless garbage warning is apparently designed to make it
painful to write portable safe arithmetic and I think we ought to
just disable it.


# 1.232 06-Jul-2022 riastradh

kern/vfs_vnops.c: Fix missing semicolon in previous.

Neglected to build and amend commit, oops.


# 1.231 06-Jul-2022 riastradh

kern/vfs_vnops.c: Sprinkle KNF.

No functional change intended.


# 1.230 06-Jul-2022 riastradh

mmap(2): Avoid overflow in overflow check in vn_mmap.


# 1.229 06-Jul-2022 riastradh

uvm(9): fo_mmap caller guarantees positive size.

No functional change intended, just sprinkling assertions to make it
clearer.


# 1.228 22-May-2022 andvar

fix various small typos, mainly in comments.


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.234 18-Jul-2022 thorpej

Make kqueue event status for vnodes shareable, and for stacked file systems
like nullfs, make the upper vnode share that status with the lower vnode.

And, lo, NetBSD 9.99.99.

Fixes PR kern/56713.


# 1.233 06-Jul-2022 riastradh

kern: Work around spurious -Wtype-limits warnings.

This useless garbage warning is apparently designed to make it
painful to write portable safe arithmetic and I think we ought to
just disable it.


# 1.232 06-Jul-2022 riastradh

kern/vfs_vnops.c: Fix missing semicolon in previous.

Neglected to build and amend commit, oops.


# 1.231 06-Jul-2022 riastradh

kern/vfs_vnops.c: Sprinkle KNF.

No functional change intended.


# 1.230 06-Jul-2022 riastradh

mmap(2): Avoid overflow in overflow check in vn_mmap.


# 1.229 06-Jul-2022 riastradh

uvm(9): fo_mmap caller guarantees positive size.

No functional change intended, just sprinkling assertions to make it
clearer.


# 1.228 22-May-2022 andvar

fix various small typos, mainly in comments.


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.233 06-Jul-2022 riastradh

kern: Work around spurious -Wtype-limits warnings.

This useless garbage warning is apparently designed to make it
painful to write portable safe arithmetic and I think we ought to
just disable it.


# 1.232 06-Jul-2022 riastradh

kern/vfs_vnops.c: Fix missing semicolon in previous.

Neglected to build and amend commit, oops.


# 1.231 06-Jul-2022 riastradh

kern/vfs_vnops.c: Sprinkle KNF.

No functional change intended.


# 1.230 06-Jul-2022 riastradh

mmap(2): Avoid overflow in overflow check in vn_mmap.


# 1.229 06-Jul-2022 riastradh

uvm(9): fo_mmap caller guarantees positive size.

No functional change intended, just sprinkling assertions to make it
clearer.


# 1.228 22-May-2022 andvar

fix various small typos, mainly in comments.


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.233 06-Jul-2022 riastradh

kern: Work around spurious -Wtype-limits warnings.

This useless garbage warning is apparently designed to make it
painful to write portable safe arithmetic and I think we ought to
just disable it.


# 1.232 06-Jul-2022 riastradh

kern/vfs_vnops.c: Fix missing semicolon in previous.

Neglected to build and amend commit, oops.


# 1.231 06-Jul-2022 riastradh

kern/vfs_vnops.c: Sprinkle KNF.

No functional change intended.


# 1.230 06-Jul-2022 riastradh

mmap(2): Avoid overflow in overflow check in vn_mmap.


# 1.229 06-Jul-2022 riastradh

uvm(9): fo_mmap caller guarantees positive size.

No functional change intended, just sprinkling assertions to make it
clearer.


# 1.228 22-May-2022 andvar

fix various small typos, mainly in comments.


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.228 22-May-2022 andvar

fix various small typos, mainly in comments.


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.227 25-Mar-2022 hannken

It is impossible for VOP_LOCK() to return ENOENT with LK_RETRY flag.
Remove the second call to VOP_LOCK().

Enable assertion "vrefcnt(vp) > 0" and assert all possible errors
for all LK_RETRY/LK_NOWAIT combinations.


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.226 19-Mar-2022 hannken

Lock vnode across VOP_OPEN.


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.225 13-Mar-2022 riastradh

vfs(9): Avoid arithmetic overflow in vn_seek.

Reported-by: syzbot+b9f9a02148a40675c38a@syzkaller.appspotmail.com


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.224 20-Oct-2021 thorpej

Overhaul of the EVFILT_VNODE kevent(2) filter:

- Centralize vnode kevent handling in the VOP_*() wrappers, rather than
forcing each individual file system to deal with it (except VOP_RENAME(),
because VOP_RENAME() is a mess and we currently have 2 different ways
of handling it; at least it's reasonably well-centralized in the "new"
way).
- Add support for NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, and NOTE_READ,
compatible with the same events in FreeBSD.
- Track which kevent notifications clients are interested in receiving
to avoid doing work for events no one cares about (avoiding, e.g.
taking locks and traversing the klist to send a NOTE_WRITE when
someone is merely watching for a file to be deleted, for example).

In support of the above:

- Add support in vnode_if.sh for specifying PRE- and POST-op handlers,
to be invoked before and after vop_pre() and vop_post(), respectively.
Basic idea from FreeBSD, but implemented differently.
- Add support in vnode_if.sh for specifying CONTEXT fields in the
vop_*_args structures. These context fields are used to convey information
between the file system VOP function and the VOP wrapper, but do not
occupy an argument slot in the VOP_*() call itself. These context fields
are initialized and subsequently interpreted by PRE- and POST-op handlers.
- Version VOP_REMOVE(), uses the a context field for the file system to report
back the resulting link count of the target vnode. Return this in tmpfs,
udf, nfs, chfs, ext2fs, lfs, and ufs.

NetBSD 9.99.92.


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.223 11-Sep-2021 riastradh

sys/kern: Avoid fp->f_offset without the object (here, vnode) lock.


# 1.222 11-Sep-2021 riastradh

sys/kern: Allow custom fileops to specify fo_seek method.

Previously only vnodes allowed lseek/pread[v]/pwrite[v], which meant
converting a regular device to a cloning device doesn't always work.

Semantics is:

(*fp->f_ops->fo_seek)(fp, delta, whence, newoffp, flags)

1. Compute a new offset according to whence + delta -- that is, if
whence is SEEK_CUR, add delta to fp->f_offset; if whence is
SEEK_END, add delta to end of file; if whence is SEEK_CUR, use delta
as is.

2. If newoffp is nonnull, return the new offset in *newoffp.

3. If flags & FOF_UPDATE_OFFSET, set fp->f_offset to the new offset.

Access to fp->f_offset, and *newoffp if newoffp = &fp->f_offset, must
happen under the object lock (e.g., vnode lock), in order to
synchronize fp->f_offset reads and writes.

This change has the side effect that every call to VOP_SEEK happens
under the vnode lock now, when previously it didn't. However, from a
review of all the VOP_SEEK implementations, it does not appear that
any file system even examines the vnode, let alone locks it. So I
think this is safe -- and essentially the only reasonable way to do
things, given that it is used to validate a change from oldoff to
newoff, and oldoff becomes stale the moment we unlock the vnode.

No kernel bump because this reuses a spare entry in struct fileops,
and it is safe for the entry to be null, so all existing fileops will
continue to work as before (rejecting seek).


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.221 18-Jul-2021 dholland

Fix confusion arising from whether FOLLOW or NOFOLLOW is 0.

In vn_open, don't set and then throw away FOLLOW, and clarify the
comment about requesting FOLLOW/NOFOLLOW behavior.

Related to PR 56316.


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


Revision tags: thorpej-i2c-spi-conf-base
# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.220 01-Jul-2021 martin

gcc (with some options) eroneously claims we would use "vp" uninitialized,
so initialize it as NULL.


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


Revision tags: thorpej-i2c-spi-conf-base
# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.219 01-Jul-2021 christos

don't clear the error before we use it to determine if we are moving or duping.


# 1.218 30-Jun-2021 dholland

Improve Christos's vn_open fix.

- assert about api misuse up front (suggested by riastradh)
- restore the behavior of returning EOPNOTSUPP if ret_fd is NULL and we
get a fd back (otherwise things like ktruss -o /dev/stderr panic)
- clear error to 0 for the EDUPFD and EMOVEFD cases so opening a
cloner succeeds


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


Revision tags: thorpej-i2c-spi-conf-base
# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.217 30-Jun-2021 christos

PR/56286: Martin Husemann: Fix NULL deref on kmod load.
- No need to set ret_domove and ret_fd in the regular case, they are meaningless
- KASSERT instead of setting errno and then doing the NULL deref.


# 1.216 29-Jun-2021 dholland

Add containment for the cloning devices hack in vn_open.

Cloning devices (and also things like /dev/stderr) work by allocating
a struct file, stuffing it in the file table (which is a layer
violation), stuffing the file descriptor number for it in a magic
field of struct lwp (which is gross), and then "failing" with one of
two magic errnos, EDUPFD or EMOVEFD.

Before this commit, all callers of vn_open in the kernel (there are
quite a few) were expected to check for these errors and handle the
situation. Needless to say, none of them except for open() itself did,
resulting in internal negative errnos being returned to userspace.

This hack is fairly deeply rooted and cannot be eliminated all at
once. This commit adds logic to handle the magic errnos inside
vn_open; now on success vn_open returns either a vnode or an integer
file descriptor, along with a flag that says whether the underlying
code requested EDUPFD or EMOVEFD. Callers not prepared to cope with
file descriptors can pass NULL for the extra return values, in which
case if a file descriptor would be produced vn_open fails with
EOPNOTSUPP.

Since I'm rearranging vn_open's signature anyway, stop exposing struct
nameidata. Instead, take three arguments: an optional vnode to use as
the starting point (like openat()), the path, and additional namei
flags to use, restricted to NOCHROOT and TRYEMULROOT. (Other namei
behavior, e.g. NOFOLLOW, can be requested via the open flags.)

This change requires a kernel bump. Ride the one an hour ago.
(That was supposed to be coordinated; did not intend to let an hour
slip by. My fault.)


Revision tags: thorpej-i2c-spi-conf-base
# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

branches: 1.214.4;
Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

branches: 1.200.4;
Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.215 16-Jun-2021 dholland

Add a new namei flag NONEXCLHACK for open with O_CREAT and not O_EXCL.

This case needs to be distinguished from the other CREATE operations
because it is supposed to successfully return (and open) the target if
it exists. In the case where that target is the root, or a mount
point, such that there's no parent dir, "real" CREATE operations fail,
but O_CREAT without O_EXCL needs to succeed.

So (a) add the flag, (b) test for it in namei in the situation
described above, (c) set it in open under the appropriate
circumstances, and (d) because this can result in namei returning
ni_dvp of NULL, cope with that case.

Should get into -9 and maybe even -8, because it was prompted by
issues with 3rd-party code. The use of a flag (vs. adding an
additional nameiop, which would be more appropriate) was deliberate to
make the patch small and noninvasive.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.214 09-Nov-2020 chs

Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


# 1.213 11-Jun-2020 ad

branches: 1.213.2;
Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.214 09-Nov-2020 chs

Lock the vnode while calling VOP_BMAP() for FIOGETBMAP.

Reported-by: syzbot+cfa1b773be7337250428@syzkaller.appspotmail.com


Revision tags: thorpej-futex-base
# 1.213 11-Jun-2020 ad

Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.213 11-Jun-2020 ad

Counter tweaks:

- Don't need to count anonpages+filepages any more; clean+unknown+dirty for
each kind of page can be summed to get the totals.

- Track the number of free pages with a counter so that it's one less thing
for the allocator to do, which opens up further options there.

- Remove cpu_count_sync_one(). It has no users and doesn't save a whole lot.
For the cheap option, give cpu_count_sync() a boolean parameter indicating
that a cached value is okay, and rate limit the updates for cached values
to hz.


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.212 23-May-2020 ad

Move proc_lock into the data segment. It was dynamically allocated because
at the time we had mutex_obj_alloc() but not __cacheline_aligned.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1
# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

branches: 1.207.4;
Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.211 13-Apr-2020 ad

Replace most uses of vp->v_usecount with a call to vrefcnt(vp), a function
that hides the details and does atomic_load_relaxed(). Signature matches
FreeBSD.


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.210 12-Apr-2020 christos

Oops missed one more NULL -> NOCRED


# 1.209 12-Apr-2020 christos

delete debugging printf.


# 1.208 12-Apr-2020 christos

Pass NOCRED instead of NULL for credentials. These routines are supposed
to be accessing system ACL's on behalf of the kernel. This code appears
to be copied from FreeBSD, but there it works because in FreeBSD NOCRED
is 0, ours is -1. I guess nobody has used system extended attributes on
NetBSD yet :-)


Revision tags: phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.207 27-Feb-2020 ad

Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.207 27-Feb-2020 ad

Tighten up the locking around vp->v_iflag a little more after the recent
split of vmobjlock & v_interlock.


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.206 23-Feb-2020 ad

UVM locking changes, proposed on tech-kern:

- Change the lock on uvm_object, vm_amap and vm_anon to be a RW lock.
- Break v_interlock and vmobjlock apart. v_interlock remains a mutex.
- Do partial PV list locking in the x86 pmap. Others to follow later.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

branches: 1.204.2;
- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.205 12-Jan-2020 ad

- Shuffle some items around in struct lwp to save space. Remove an unused
item or two.

- For lockstat, get a useful callsite for vnode locks (caller to vn_lock()).


Revision tags: ad-namecache-base
# 1.204 16-Dec-2019 ad

- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.204 16-Dec-2019 ad

- Extend the per-CPU counters matt@ did to include all of the hot counters
in UVM, excluding uvmexp.free, which needs special treatment and will be
done with a separate commit. Cuts system time for a build by 20-25% on
a 48 CPU machine w/DIAGNOSTIC.

- Avoid 64-bit integer divide on every fault (for rnd_add_uint32).


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.203 01-Dec-2019 ad

Minor vnode locking changes:

- Stop using atomics to maniupulate v_usecount. It was a mistake to begin
with. It doesn't work as intended unless the XLOCK bit is incorporated in
v_usecount and we don't have that any more. When I introduced this 10+
years ago it was to reduce pressure on v_interlock but it doesn't do that,
it just makes stuff disappear from lockstat output and introduces problems
elsewhere. We could do atomic usecounts on vnodes but there has to be a
well thought out scheme.

- Resurrect LK_UPGRADE/LK_DOWNGRADE which will be needed to work effectively
when there is increased use of shared locks on vnodes.

- Allocate the vnode lock using rw_obj_alloc() to reduce false sharing of
struct vnode.

- Put all of the LRU lists into a single cache line, and do not requeue a
vnode if it's already on the correct list and was requeued recently (less
than a second ago).

Kernel build before and after:

119.63s real 1453.16s user 2742.57s system
115.29s real 1401.52s user 2690.94s system


Revision tags: phil-wifi-20191119
# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.202 10-Nov-2019 mlelstv

Add functions to open devices by device number or path.


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.201 15-Sep-2019 christos

set VEXEC if FEXEC is set.


Revision tags: netbsd-9-base phil-wifi-20190609 isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2; 1.197.4;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


Revision tags: isaki-audio2-base
# 1.200 07-Mar-2019 hannken

Change vn_openchk() to fail VNON and VBAD with error ENXIO.

Reported-by: syzbot+d66b1be08516a4d2d2b2@syzkaller.appspotmail.com
Reported-by: syzbot+c5eaef5a8af535c3b217@syzkaller.appspotmail.com


# 1.199 04-Feb-2019 mrg

s/fall into .../FALLTHROUGH/


Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.198 03-Sep-2018 riastradh

Rename min/max -> uimin/uimax for better honesty.

These functions are defined on unsigned int. The generic name
min/max should not silently truncate to 32 bits on 64-bit systems.
This is purely a name change -- no functional change intended.

HOWEVER! Some subsystems have

#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))

even though our standard name for that is MIN/MAX. Although these
may invite multiple evaluation bugs, these do _not_ cause integer
truncation.

To avoid `fixing' these cases, I first changed the name in libkern,
and then compile-tested every file where min/max occurred in order to
confirm that it failed -- and thus confirm that nothing shadowed
min/max -- before changing it.

I have left a handful of bootloaders that are too annoying to
compile-test, and some dead code:

cobalt ews4800mips hp300 hppa ia64 luna68k vax
acorn32/if_ie.c (not included in any kernels)
macppc/if_gm.c (superseded by gem(4))

It should be easy to fix the fallout once identified -- this way of
doing things fails safe, and the goal here, after all, is to _avoid_
silent integer truncations, not introduce them.

Maybe one day we can reintroduce min/max as type-generic things that
never silently truncate. But we should avoid doing that for a while,
so that existing code has a chance to be detected by the compiler for
conversion to uimin/uimax without changing the semantics until we can
properly audit it all. (Who knows, maybe in some cases integer
truncation is actually intended!)


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

branches: 1.197.2;
add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

branches: 1.195.6;
Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


Revision tags: tls-maxphys-base-20171202
# 1.197 30-Nov-2017 christos

add fo_name so we can identify the fileops in a simple way.


# 1.196 09-Nov-2017 christos

Add O_REGULAR to enforce opening of only regular files
(like we have O_DIRECTORY for directories).
This is better than open(, O_NONBLOCK), fstat()+S_ISREG() because opening
devices can have side effects.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


Revision tags: prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.195 30-Mar-2017 hannken

Lock the vnode before changing its writecount.


Revision tags: pgoyette-localcount-20170320
# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

branches: 1.193.2; 1.193.4;
Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision


# 1.194 01-Mar-2017 hannken

Must always lock the parent -> lock the child -> unlock the parent.


Revision tags: nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004 localcount-20160914 pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.193 04-Feb-2015 msaitoh

Remove useless semicolon reported by Henning Petersen in PR#49634.


# 1.192 14-Dec-2014 chs

add a new "fo_mmap" fileops method to allow use of arbitrary uvm_objects for
mappings of file objects. move vnode-specific details of mmap()ing a vnode
from uvm_mmap() to the new vnode-specific vn_mmap(). add new uvm_mmap_dev()
and uvm_mmap_anon() convenience functions for mapping character devices
and anonymous memory, and replace all other calls to uvm_mmap() with those.
use the new fileop in drm2 so that libdrm can use mmap() to map things
like on other platforms (instead of the ioctl that we have used so far).


Revision tags: nick-nhusb-base
# 1.191 05-Sep-2014 matt

branches: 1.191.2;
Try not to use f_data, use f_{vnode,socket,pipe,mqueue,kqueue,ksem} to get
a correctly typed pointer.


Revision tags: netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.190 22-Jun-2014 maxv

branches: 1.190.2;
Fix a NULL pointer dereference after a loooong discussion with dholland@,
hannken@, blymn@ and martin@.

This bug would panic the system when veriexec is set to the VERIEXEC_LOCKDOWN
mode (only settable from root).


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.189 27-Feb-2014 hannken

branches: 1.189.2;
The current implementation of vn_lock() is racy. Modification of
the vnode operations vector for active vnodes is unsafe because it
is not known whether deadfs or the original file system will be
called.

- Pass down LK_RETRY to the lock operation (hint for deadfs only).

- Change deadfs lock operation to return ENOENT if LK_RETRY is unset.

- Change all other lock operations to check for dead vnode once
the vnode is locked and unlock and return ENOENT in this case.

With these changes in place vnode lock operations will never succeed
after vclean() has marked the vnode as VI_XLOCK and before vclean()
has changed the operations vector.

Adresses PR kern/37706 (Forced unmount of file systems is unsafe)

Discussed on tech-kern.

Welcome to 6.99.33


# 1.188 23-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to return
the resulting vnode *vpp unlocked.

Discussed on tech-kern@

Welcome to 6.99.30


# 1.187 17-Jan-2014 hannken

Change vnode operations create, mknod, mkdir and symlink to keep the
directory node dvp locked on return.

Discussed on tech-kern@

Welcome to 6.99.29


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base agc-symver-base yamt-pagecache-base8 yamt-pagecache-base7
# 1.186 12-Nov-2012 hannken

branches: 1.186.2;
Bring back Manuel Bouyers patch to resolve races between vget() and vrelel()
resulting in vget() returning dead vnodes.
It is impossible to resolve these races in vn_lock().

Needs pullup to NetBSD-6.


Revision tags: yamt-pagecache-base6
# 1.185 24-Aug-2012 dholland

branches: 1.185.2;
don't truncate size_t to int


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8
# 1.184 05-Apr-2012 hannken

Fix vn_lock() to return an invalid (dead, clean) vnode
only if the caller requested it by setting LK_RETRY.

Should fix PR #46221: Kernel panic in NFS server code


Revision tags: jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-pre-base2 jmcneill-usbmp-base2 netbsd-6-base jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.183 14-Oct-2011 hannken

branches: 1.183.2; 1.183.6; 1.183.8;
Change the vnode locking protocol of VOP_GETATTR() to request at least
a shared lock. Make all calls outside of file systems respect it.

The calls from file systems need review.

No objections from tech-kern.


# 1.182 16-Aug-2011 yamt

vn_close: add an assertion


# 1.181 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base bouyer-quota2-nbase bouyer-quota2-base jruoho-x86intr-base matt-mips64-premerge-20101231 rmind-uvmplock-base
# 1.180 19-Nov-2010 dholland

branches: 1.180.6;
Introduce struct pathbuf. This is an abstraction to hold a pathname
and the metadata required to interpret it. Callers of namei must now
create a pathbuf and pass it to NDINIT (instead of a string and a
uio_seg), then destroy the pathbuf after the namei session is
complete.

Update all namei call sites accordingly. Add a pathbuf(9) man page and
update namei(9).

The pathbuf interface also now appears in a couple of related
additional places that were passing string/uio_seg pairs that were
later fed into NDINIT. Update other call sites accordingly.


Revision tags: uebayasi-xip-base4
# 1.179 28-Oct-2010 pooka

Zero entire stat structure before filling in contents to avoid
leaking kernel memory -- the elements are no longer packed now that
dev_t is 64bit.

from pgoyette


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11
# 1.178 21-Sep-2010 chs

implement O_DIRECTORY as standardized in POSIX-2008,
for both native and linux emulations.
this fixes the rest of PR 43695.


# 1.177 25-Aug-2010 pooka

I'm not even going to describe this change. I'll just say that
churn creates interesting code.

Fixes open(O_CREAT|O_TRUNC) on at least tmpfs and nfs to not fail
with ENOENT due to a racy removal of the newly created file.

Caught, as most bugs these days are, by a test run.


Revision tags: uebayasi-xip-base2 yamt-nfs-mp-base10
# 1.176 28-Jul-2010 hannken

Modify vn_lock():
- Take v_interlock before examining v_iflag
- Must always be called without v_interlock taken,
LK_INTERLOCK flag is no longer allowed.


# 1.175 13-Jul-2010 pooka

Don't leak kernel stack into userspace.


# 1.174 24-Jun-2010 hannken

Clean up vnode lock operations pass 2:

VOP_UNLOCK(vp, flags) -> VOP_UNLOCK(vp): Remove the unneeded flags argument.

Welcome to 5.99.32.

Discussed on tech-kern.


# 1.173 18-Jun-2010 hannken

Remove the concept of recursive vnode locks by eliminating
vn_setrecurse(), vn_restorerecurse() and LK_CANRECURSE.
Welcome to 5.99.31

Discussed on tech-kern.


# 1.172 06-Jun-2010 hannken

Change layered file systems to always pass the locking VOP's down to the
leaf file system. Remove now unused member v_vnlock from struct vnode.
Welcome to 5.99.30

Discussed on tech-kern.


Revision tags: uebayasi-xip-base1
# 1.171 23-Apr-2010 pooka

Enforce RLIMIT_FSIZE before VOP_WRITE. This adds support to file
system drivers where it was missing from and fixes one buggy
implementation. The arguably weird semantics of the check are
maintained (v_size vs. va_bytes, overwrite).


# 1.170 29-Mar-2010 pooka

Stop exposing fifofs internals and leave only fifo_vnodeop_p visible.


Revision tags: yamt-nfs-mp-base9 uebayasi-xip-base
# 1.169 08-Jan-2010 pooka

branches: 1.169.2; 1.169.4;
The VATTR_NULL/VREF/VHOLD/HOLDRELE() macros lost their will to live
years ago when the kernel was modified to not alter ABI based on
DIAGNOSTIC, and now just call the respective function interfaces
(in lowercase). Plenty of mix'n match upper/lowercase has creeped
into the tree since then. Nuke the macros and convert all callsites
to lowercase.

no functional change


# 1.168 20-Dec-2009 dsl

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567


Revision tags: matt-premerge-20091211
# 1.167 09-Dec-2009 dsl

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 jym-xensuspend-nbase
# 1.166 17-May-2009 yamt

remove FILE_LOCK and FILE_UNLOCK.


Revision tags: yamt-nfs-mp-base4 yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.165 11-Apr-2009 christos

Fix locking as Andy explained. Also fill in uid and gid like sys_pipe did.


# 1.164 04-Apr-2009 ad

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.

Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.

thr0 accept(fd, ...)
thr1 close(fd)


Revision tags: nick-hppapmap-base2
# 1.163 11-Feb-2009 enami

Make module (auto)loading under chroot envrionment actually work:
- NOCHROOT flag must be assigned to different bit from TRYEMULROOT
since the code expected to be executed is in the else clase of
if (flags & TRYEMULROOT).
- Necessary variables aren't set.


Revision tags: mjf-devfs2-base
# 1.162 17-Jan-2009 yamt

branches: 1.162.2;
malloc -> kmem_alloc.


Revision tags: haad-dm-base2 haad-nbase2 ad-audiomp2-base haad-dm-base
# 1.161 12-Nov-2008 ad

Remove LKMs and switch to the module framework, pass 1.

Proposed on tech-kern@.


Revision tags: netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3 wrstuden-revivesa-base-2
# 1.160 27-Aug-2008 christos

branches: 1.160.2; 1.160.4;
Writing 0 bytes on an O_APPEND file should not affect the offset


# 1.159 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: wrstuden-revivesa-base-1 simonb-wapbl-nbase yamt-pf42-base4 simonb-wapbl-base yamt-pf42-base3 wrstuden-revivesa-base
# 1.158 02-Jun-2008 ad

branches: 1.158.2; 1.158.4;
Don't needlessly acquire v_interlock.


# 1.157 02-Jun-2008 ad

vn_marktext, vn_lock: don't needlessly acquire v_interlock.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2 yamt-nfs-mp-base2 yamt-nfs-mp-base
# 1.156 24-Apr-2008 ad

branches: 1.156.2; 1.156.4;
Network protocol interrupts can now block on locks, so merge the globals
proclist_mutex and proclist_lock into a single adaptive mutex (proc_lock).
Implications:

- Inspecting process state requires thread context, so signals can no longer
be sent from a hardware interrupt handler. Signal activity must be
deferred to a soft interrupt or kthread.

- As the proc state locking is simplified, it's now safe to take exit()
and wait() out from under kernel_lock.

- The system spends less time at IPL_SCHED, and there is less lock activity.


Revision tags: yamt-pf42-baseX yamt-pf42-base ad-socklock-base1 yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.155 21-Mar-2008 ad

branches: 1.155.2;
Catch up with descriptor handling changes. See kern_descrip.c revision
1.173 for details.


Revision tags: keiichi-mipv6-nbase nick-net80211-sync-base keiichi-mipv6-base matt-armv6-nbase mjf-devfs-base hpcarm-cleanup-base
# 1.154 30-Jan-2008 ad

branches: 1.154.6;
Replace struct lock on vnodes with a simpler lock object built on
krwlock_t. This is a step towards removing lockmgr and simplifying
vnode locking. Discussed on tech-kern.


# 1.153 25-Jan-2008 ad

vn_setrecurse: if no lock is exported, use v_lock. Works around issue
described in PR kern/37808. The ideal solution here is to kill vnode
lock recursion, which should not be hard once it is understood what
the two remaining callers of vn_setrecurse() are doing.


# 1.152 25-Jan-2008 ad

Remove VOP_LEASE. Discussed on tech-kern.


# 1.151 25-Jan-2008 pooka

vn_write: include f_advice in VOP_WRITE


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.150 05-Jan-2008 dsl

Use FILE_LOCK() and FILE_UNLOCK()


# 1.149 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3 yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base jmcneill-pm-base
# 1.148 08-Dec-2007 pooka

branches: 1.148.4;
Remove cn_lwp from struct componentname. curlwp should be used
from on. The NDINIT() macro no longer takes the lwp parameter and
associates the credentials of the calling thread with the namei
structure.


Revision tags: vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase reinoud-bufcleanup-base
# 1.147 02-Dec-2007 hannken

branches: 1.147.2;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


# 1.146 30-Nov-2007 yamt

- reduce the number of VOP_ACCESS calls for O_RDWR. for nfs, it reduces
the number of rpcs.
- reduce code duplication.


# 1.145 29-Nov-2007 ad

Use atomics to maintain uvmexp.{anon,exec,file}pages.


# 1.144 26-Nov-2007 pooka

Remove the "struct lwp *" argument from all VFS and VOP interfaces.
The general trend is to remove it from all kernel interfaces and
this is a start. In case the calling lwp is desired, curlwp should
be used.

quick consensus on tech-kern


Revision tags: jmcneill-base bouyer-xenamd64-base2 yamt-x86pmap-base4 bouyer-xenamd64-base yamt-x86pmap-base3 vmlocking-base
# 1.143 10-Oct-2007 ad

branches: 1.143.4;
Merge from vmlocking:

- Split vnode::v_flag into three fields, depending on field locking.
- simple_lock -> kmutex in a few places.
- Fix some simple locking problems.


# 1.142 08-Oct-2007 ad

Merge file descriptor locking, cwdi locking and cross-call changes
from the vmlocking branch.


# 1.141 07-Oct-2007 hannken

Update the file system copy-on-write handler.

- Instead of hooking the handler on the specdev of a mounted file system
hook directly on the `struct mount'.

- Rename from `vn_cow_*' to `fscow_*' and move to `kern/vfs_trans.c'. Use
`mount_*specific' instead of clobbering `struct mount' or `struct specinfo'.

- Replace the hand-made reader/writer lock with a krwlock.

- Keep `vn_cow_*' functions and mark as obsolete.

- Welcome to NetBSD 4.99.32 - `struct specinfo' changed size.

Reviewed by: Jason Thorpe <thorpej@netbsd.org>


Revision tags: nick-csl-alignment-base5 yamt-x86pmap-base2 yamt-x86pmap-base matt-mips64-base
# 1.140 22-Jul-2007 pooka

branches: 1.140.4; 1.140.6; 1.140.8; 1.140.10;
Retire uvn_attach() - it abuses VXLOCK and its functionality,
setting vnode sizes, is handled elsewhere: file system vnode creation
or spec_open() for regular files or block special files, respectively.

Add a call to VOP_MMAP() to the pagedvn exec path, since the vnode
is being memory mapped.

reviewed by tech-kern & wrstuden


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.139 19-May-2007 christos

branches: 1.139.2;
- remove pathname_ interface.
- use macros to deal with pathnames in userspace, when veriexec is used.
- reorder the veriexec_ call arguments for consistency.
With help from elad@ finding the last bug.


Revision tags: yamt-idlelwp-base8
# 1.138 22-Apr-2007 dsl

Change the way that emulations locate files within the emulation root to
avoid having to allocate space in the 'stackgap'
- which is very LWP unfriendly.
The additional code for non-emulation namei() is trivial, the reduction for
the emulations is massive.
The vnode for a processes emulation root is saved in the cwdi structure
during process exec.
If the emulation root the TRYEMULROOT flag are set, namei() will do an initial
search for absolute pathnames in the emulation root, if that fails it will
retry from the normal root.
".." at the emulation root will always go to the real root, even in the middle
of paths and when expanding symlinks.
Absolute symlinks found using absolute paths in the emulation root will be
relative to the emulation root (so /usr/lib/xxx.so -> /lib/xxx.so links
inside the emulation root don't need changing).
If the root of the emulation would be returned (for an emulation lookup), then
the real root is returned instead (matching the behaviour of emul_lookup,
but being a cheap comparison here) so that programs that scan "../.."
looking for the root dircetory don't loop forever.
The target for symbolic links is no longer mangled (it used to get the
CHECK_ALT_xxx() treatment, so could get /emul/xxx prepended).
CHECK_ALT_xxx() are no more. Most of the change is deleting them, and adding
TRYEMULROOT to the flags to NDINIT().
A lot of the emulation system call stubs could now be deleted.


Revision tags: thorpej-atomic-base
# 1.137 08-Apr-2007 hannken

Remove now obsolete vn_start_write() and vn_finished_write() and
corresponding flags.

Revert softdep_trackbufs() to its state before vn_start_write() was added.

Remove from struct mount now unneeded flags IMNT_SUSPEND* and
members mnt_writeopcountupper, mnt_writeopcountlower and mnt_leaf.

Welcome to 4.99.17


# 1.136 03-Apr-2007 hannken

Remove calls to now obsolete vn_start_write() and vn_finished_write().


# 1.135 09-Mar-2007 ad

branches: 1.135.2; 1.135.4;
- Make the proclist_lock a mutex. The write:read ratio is unfavourable,
and mutexes are cheaper use than RW locks.
- LOCK_ASSERT -> KASSERT in some places.
- Hold proclist_lock/kernel_lock longer in a couple of places.


# 1.134 04-Mar-2007 christos

Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.133 16-Feb-2007 hannken

branches: 1.133.2;
Make fstrans(9) the default helper for file system suspension.
Replaces the now obsolete vn_start_write()/vn_finished_write().


Revision tags: post-newlock2-merge
# 1.132 09-Feb-2007 ad

Merge newlock2 to head.


Revision tags: newlock2-nbase newlock2-base
# 1.131 19-Jan-2007 hannken

New file system suspension API to replace vn_start_write and vn_finished_write.
The suspension helpers are now put into file system specific operations.
This means every file system not supporting these helpers cannot be suspended
and therefore snapshots are no longer possible.

Implemented for file systems of type ffs.

The new API is enabled on a kernel option NEWVNGATE. This option is
not enabled by default in any kernel config.

Presented and discussed on tech-kern with much input from
Bill Studenmund <wrstuden@netbsd.org> and YAMAMOTO Takashi <yamt@netbsd.org>.

Welcome to 4.99.9 (new vfs op vfs_suspendctl).


# 1.130 30-Dec-2006 elad

Avoid TOCTOU in Veriexec by introducing veriexec_openchk() to enforce
the policy and using a single namei() call in vn_open().


Revision tags: yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 netbsd-4-base
# 1.129 30-Nov-2006 elad

branches: 1.129.2;
Massive restructuring and cleanup of Veriexec, mainly in preparation
for work on some future functionality.

- Veriexec data-structures are no longer exposed.

- Thanks to using proplib for data passing now, the interface
changes further to accomodate that.

Introduce four new functions. First, veriexec_file_add(), to add
a new file to be monitored by Veriexec, to replace both
veriexec_load() and veriexec_hashadd(). veriexec_table_add(), to
replace veriexec_newtable(), will be used to optimize hash table
size (during preload), and finally, veriexec_convert(), to convert
an internal entry to one userland can read.

- Introduce veriexec_unmountchk(), to enforce Veriexec unmount
policy. This cleans up a bit of code in kern/vfs_syscalls.c.

- Rename veriexec_tblfind() with veriexec_table_lookup(), and make
it static. More functions that became static: veriexec_fp_cmp(),
veriexec_fp_calc().

- veriexec_verify() no longer returns the entry as well, but just
sets a boolean indicating whether an entry was found or not.

- veriexec_purge() now takes a struct vnode *.

- veriexec_add_fp_name() was merged into veriexec_add_fp_ops(), that
changed its name to veriexec_fpops_add(). veriexec_find_ops() was
also renamed to veriexec_fpops_lookup().

Also on the fp-ops front, the three function types used to initialize,
update, and finalize a hash context were renamed to
veriexec_fpop_init_t, veriexec_fpop_update_t, and veriexec_fpop_final_t
respectively.

- Introduce a new malloc(9) type, M_VERIEXEC, and use it instead of
M_TEMP, so we can tell exactly how much memory is used by Veriexec.

- And, most importantly, whitespace and indentation nits.

Built successfuly for amd64, i386, sparc, and sparc64. Tested on amd64.


# 1.128 01-Nov-2006 elad

printf() -> log().


# 1.127 28-Oct-2006 elad

Adapt to changes suggested by yamt@ to get rid of __UNCONST() stuff.

While here, don't leak pathbuf on success.


# 1.126 27-Oct-2006 elad

Don't allocate MAXPATHLEN on the stack.

Prompted by and initial diff okay yamt@


Revision tags: yamt-splraiseipl-base2
# 1.125 05-Oct-2006 chs

add support for O_DIRECT (I/O directly to application memory,
bypassing any kernel caching for file data).


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.124 12-Sep-2006 elad

branches: 1.124.2;
Fix typo.


# 1.123 10-Sep-2006 blymn

Prevent a veriexec file from being truncated.


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 rpaulo-netinet-merge-pcb-base
# 1.122 26-Jul-2006 dogcow

branches: 1.122.4;
at the request of elad, as veriexec.h has returned, revert the changes
from 2006-07-25.


# 1.121 25-Jul-2006 dogcow

mechanically go through and
s,include "veriexec.h",include <sys/verified_exec.h>,
as the former has apparently gone away.


# 1.120 24-Jul-2006 elad

replace magic numbers for strict levels (0-3) with defines.


# 1.119 24-Jul-2006 elad

finally do things properly. veriexec_report() takes flags, not three ints.


# 1.118 24-Jul-2006 elad

some fixes:
- adapt to NVERIEXEC in init_sysctl.c.
- we now need "veriexec.h" for NVERIEXEC.
- "opt_verified_exec.h" -> "opt_veriexec.h", and include it only where
it is needed.


# 1.117 23-Jul-2006 ad

Use the LWP cached credentials where sane.


# 1.116 22-Jul-2006 elad

kill a VOP_GETATTR() we don't need for veriexec.


# 1.115 22-Jul-2006 elad

deprecate the VERIFIED_EXEC option; now we only need the pseudo-device to
enable it. while here, some config file tweaks.

tons of input from cube@ (thanks!) and okay blymn@.


# 1.114 16-Jul-2006 elad

oops, forgot to commit that one. thanks Arnaud Lacombe.


# 1.113 14-Jul-2006 elad

okay, since there was no way to divide this to two commits, here it goes..

introduce fileassoc(9), a kernel interface for associating meta-data with
files using in-kernel memory. this is very similar to what we had in
veriexec till now, only abstracted so it can be used more easily by more
consumers.

this also prompted the redesign of the interface, making it work on vnodes
and mounts and not directly on devices and inodes. internally, we still
use file-id but that's gonna change soon... the interface will remain
consistent.

as a result, veriexec went under some heavy changes to conform to the new
interface. since we no longer use device numbers to identify file-systems,
the veriexec sysctl stuff changed too: kern.veriexec.count.dev_N is now
kern.veriexec.tableN.* where 'N' is NOT the device number but rather a
way to distinguish several mounts.

also worth noting is the plugging of unmount/delete operations
wrt/fileassoc and veriexec.

tons of input from yamt@, wrstuden@, martin@, and christos@.


Revision tags: yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.112 27-May-2006 simonb

Limit the size of any kernel buffers allocated by the VOP_READDIR
routines to MAXBSIZE.


Revision tags: yamt-pdpolicy-base5
# 1.111 14-May-2006 elad

branches: 1.111.2;
integrate kauth.


# 1.110 14-May-2006 christos

XXX: GCC uninitialized.


Revision tags: elad-kernelauth-base
# 1.109 04-May-2006 perseant

Change VOP_FCNTL to take an unlocked vnode. Approved by wrstuden@.


Revision tags: yamt-pdpolicy-base4 yamt-pdpolicy-base3
# 1.108 24-Mar-2006 hannken

vn_rdwr(): Initialize `mp' to NULL. vn_finished_write() would be called
with uninitialized `mp' if `vp->v_type == VCHR'.

From Coverity CID 2475.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.107 10-Mar-2006 yamt

branches: 1.107.2;
remove a wrong assertion.


Revision tags: yamt-pdpolicy-base
# 1.106 01-Mar-2006 yamt

branches: 1.106.2; 1.106.4;
merge yamt-uio_vmspace branch.

- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.


Revision tags: yamt-uio_vmspace-base5
# 1.105 04-Feb-2006 yamt

vn_read: don't bother to allocate read-ahead context here.
it will be done in uvn_get if necessary.


# 1.104 01-Jan-2006 yamt

branches: 1.104.2; 1.104.4;
vn_lock: LK_CANRECURSE is used by layered filesystems. pointed by cube@.


# 1.103 31-Dec-2005 yamt

vn_lock: assert that only a limited set of LK_* flags is used.


# 1.102 12-Dec-2005 elad

branches: 1.102.2;
Catch up with ktrace-lwp merge.

While I'm here, stop using cur{lwp,proc}.


# 1.101 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: ktrace-lwp-base
# 1.100 29-Nov-2005 yamt

merge yamt-readahead branch.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-base
# 1.99 08-Nov-2005 hannken

branches: 1.99.2;
vput() -> vrele(). Vnode is already unlocked.
With much help from Pavel Cahyna.

Fixes PR 32005.


Revision tags: yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base
# 1.98 15-Oct-2005 elad

copystr and copyinstr return int, not void.


# 1.97 14-Oct-2005 christos

No need for __UNCONST in previous commit; factor out the function call.


# 1.96 14-Oct-2005 elad

Copy the path to a kernel buffer before using it from ndp, as it may be a
pointer to userspace.


# 1.95 20-Sep-2005 yamt

uninline vn_start_write and vn_finished_write as they are big enough.


# 1.94 23-Jul-2005 erh

Fix a null vp panic when creating a file at veriexec strict level 3.


# 1.93 16-Jul-2005 christos

defopt verified_exec.


# 1.92 19-Jun-2005 elad

branches: 1.92.2;
- Avoid pollution of struct vnode. Save the fingerprint evaluation status
in the veriexec table entry; the lookups are very cheap now. Suggested
by Chuq.

- Handle non-regular (!VREG) files correctly).

- Remove (no longer needed) FINGERPRINT_NOENTRY.


# 1.91 17-Jun-2005 elad

More veriexec changes:

- Better organize strict level. Now we have 4 levels:
- Level 0, learning mode: Warnings only about anything that might've
resulted in 'access denied' or similar in a higher strict level.

- Level 1, IDS mode:
- Deny access on fingerprint mismatch.
- Deny modification of veriexec tables.

- Level 2, IPS mode:
- All implications of strict level 1.
- Deny write access to monitored files.
- Prevent removal of monitored files.
- Enforce access type - 'direct', 'indirect', or 'file'.

- Level 3, lockdown mode:
- All implications of strict level 2.
- Prevent creation of new files.
- Deny access to non-monitored files.

- Update sysctl(3) man-page with above. (date bumped too :)

- Remove FINGERPRINT_INDIRECT from possible fp_status values; it's no
longer needed.

- Simplify veriexec_removechk() in light of new strict level policies.

- Eliminate use of 'securelevel'; veriexec now behaves according to
its strict level only.


# 1.90 11-Jun-2005 elad

Work according to veriexec strict level, not securelevel. Also, use the
veriexec_report() routine when possible; and when opening a file for writing,
only invalidate the fingerprint - not always the data will be changed.


# 1.89 05-Jun-2005 thorpej

Use ANSI function decls.


# 1.88 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.87 20-Apr-2005 blymn

Rototill of the verified exec functionality.
* We now use hash tables instead of a list to store the in kernel
fingerprints.
* Fingerprint methods handling has been made more flexible, it is now
even simpler to add new methods.
* the loader no longer passes in magic numbers representing the
fingerprint method so veriexecctl is not longer kernel specific.
* fingerprint methods can be tailored out using options in the kernel
config file.
* more fingerprint methods added - rmd160, sha256/384/512
* veriexecctl can now report the fingerprint methods supported by the
running kernel.
* regularised the naming of some portions of veriexec.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.86 26-Feb-2005 perry

branches: 1.86.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.85 02-Jan-2005 thorpej

branches: 1.85.2; 1.85.4;
Add the system call and VFS infrastructure for file system extended
attributes.

From FreeBSD.


# 1.84 12-Dec-2004 yamt

vn_lock: #if 0 out an assertion for now. (until PR/27021 is fixed)


Revision tags: kent-audio1-base
# 1.83 30-Nov-2004 christos

Cloning cleanup:
1. make fileops const
2. add 2 new negative errno's to `officially' support the cloning hack:
- EDUPFD (used to overload ENODEV)
- EMOVEFD (used to overload ENXIO)
3. Created an fdclone() function to encapsulate the operations needed for
EMOVEFD, and made all cloners use it.
4. Centralize the local noop/badop fileops functions to:
fnullop_fcntl, fnullop_poll, fnullop_kqfilter, fbadop_stat


# 1.82 06-Nov-2004 christos

Fix another stupid typo.


# 1.81 06-Nov-2004 wrstuden

Add support for FIONWRITE and FIONSPACE ioctls. FIONWRITE reports
the number of bytes in the send queue, and FIONSPACE reports the
number of free bytes in the send queue. These ioctls permit applications
to monitor file descriptor transmission dynamics.

In examining prior art, FIONWRITE exists with the semantics given
here. FIONSPACE is provided so that programs may easily determine how
much space is left in the send queue; they do not need to know the
send queue size.

The fact that a write may block even if there is enough space in the
send queue for it is noted in the documentation.

FIONWRITE functionality may be used to implement TIOCOUTQ for Linux
emulation - Linux extended this ioctl to sockets, even though they are
not ttys.


# 1.80 31-May-2004 yamt

vn_lock: add an assertion about usecount.


# 1.79 30-May-2004 yamt

vn_lock: don't pass LK_RETRY to VOP_LOCK.


# 1.78 25-May-2004 hannken

Add ffs internal snapshots. Written by Marshall Kirk McKusick for FreeBSD.

- Not enabled by default. Needs kernel option FFS_SNAPSHOT.
- Change parameters of ffs_blkfree.
- Let the copy-on-write functions return an error so spec_strategy
may fail if the copy-on-write fails.
- Change genfs_*lock*() to use vp->v_vnlock instead of &vp->v_lock.
- Add flag B_METAONLY to VOP_BALLOC to return indirect block buffer.
- Add a function ffs_checkfreefile needed for snapshot creation.
- Add special handling of snapshot files:
Snapshots may not be opened for writing and the attributes are read-only.
Use the mtime as the time this snapshot was taken.
Deny mtime updates for snapshot files.
- Add function transferlockers to transfer any waiting processes from
one lock to another.
- Add vfsop VFS_SNAPSHOT to take a snapshot and make it accessible through
a vnode.
- Add snapshot support to ls, fsck_ffs and dump.

Welcome to 2.0F.

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


Revision tags: netbsd-2-0-3-RELEASE netbsd-2-1-RELEASE netbsd-2-1-RC6 netbsd-2-1-RC5 netbsd-2-1-RC4 netbsd-2-1-RC3 netbsd-2-1-RC2 netbsd-2-1-RC1 netbsd-2-0-2-RELEASE netbsd-2-0-1-RELEASE netbsd-2-base netbsd-2-0-RELEASE netbsd-2-0-RC5 netbsd-2-0-RC4 netbsd-2-0-RC3 netbsd-2-0-RC2 netbsd-2-0-RC1 netbsd-2-0-base
# 1.77 14-Feb-2004 hannken

branches: 1.77.2; 1.77.4; 1.77.6;
Add a generic copy-on-write hook to add/remove functions that will be
called with every buffer written through spec_strategy().

Used by fss(4). Future file-system-internal snapshots will need them too.

Welcome to 1.6ZK

Approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.76 10-Jan-2004 hannken

Allow vfs_write_suspend() to wait if the file system is already
suspending.

Move vfs_write_suspend() and vfs_write_resume() from kern/vfs_vnops.c
to kern/vfs_subr.c.

Change vnode write gating in ufs/ffs/ffs_softdep.c (from FreeBSD).

When vnodes are throttled in softdep_trackbufs() check for
file system suspension every 10 msecs to avoid a deadlock.


# 1.75 15-Oct-2003 hannken

Add the gating of system calls that cause modifications to the underlying
file system.
The function vfs_write_suspend stops all new write operations to a file
system, allows any file system modifying system calls already in progress
to complete, then sync's the file system to disk and returns. The
function vfs_write_resume allows the suspended write operations to
complete.

From FreeBSD with slight modifications.

Approved by: Frank van der Linden <fvdl@netbsd.org>


# 1.74 29-Sep-2003 cb

fix O_NOFOLLOW for non-O_CREAT case.

Reviewed by: christos@ (some time ago)


# 1.73 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.72 29-Jun-2003 fvdl

branches: 1.72.2;
Back out the lwp/ktrace changes. They contained a lot of colateral damage,
and need to be examined and discussed more.


# 1.71 29-Jun-2003 thorpej

Adjust to ktrace/lwp changes.


# 1.70 28-Jun-2003 darrenr

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.

Bump the kernel rev up to 1.6V


# 1.69 03-Apr-2003 fvdl

Copy birthtime in vn_stat.


# 1.68 21-Mar-2003 dsl

Use 'void *' instead of 'caddr_t' in prototypes of VOP_IOCTL, VOP_FCNTL
and VOP_ADVLOCK, delete casts from callers (and some to copyin/out).


# 1.67 21-Mar-2003 dsl

Change 'data' argument to fo_ioctl and fo_fcntl from 'caddr_t' to 'void *'.
Avoids a lot of casting and removes the need for some line breaks.
Removed a load of (caddr_t) casts from calls to copyin/copyout as well.
(approved by christos - he has a plan to remove caddr_t...)


# 1.66 17-Mar-2003 jdolecek

make it possible for UNION fs to be loaded via LKM - instead of
having some #ifdef UNION code in vfs_vnops.c, introduce variable
'vn_union_readdir_hook' which is set to address of appropriate
vn_readdir() hook by union filesystem when it's loaded & mounted


# 1.65 16-Mar-2003 jdolecek

move union filesystem code from sys/miscfs/union to sys/fs/union


# 1.64 03-Mar-2003 jdolecek

only pull in/declare veriexec related stuff with VERIFIED_EXEC


# 1.63 24-Feb-2003 perseant

Allow filesystems' VOP_IOCTL to catch ioctl calls on directories and
regular files. Approved thorpej, fvdl.


# 1.62 01-Feb-2003 atatat

Check for (and deny) negative values passed to FIOGETBMAP.


# 1.61 24-Jan-2003 fvdl

Bump daddr_t to 64 bits. Replace it with int32_t in all places where
it was used on-disk, so that on-disk formats remain the same.
Remove ufs_daddr_t and ufs_lbn_t for the time being.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base
# 1.60 11-Dec-2002 atatat

Provide a ioctl called FIOGETBMAP (there are some who call
it...FIBMAP) that translates a logical block number to a physical
block number from the underlying device. Via VOP_BMAP().


# 1.59 06-Dec-2002 christos

s/NOSYMLINK/O_NOFOLLOW/


# 1.58 29-Oct-2002 blymn

Added support for fingerprinted executables aka verified exec


Revision tags: kqueue-aftermerge
# 1.57 23-Oct-2002 jdolecek

merge kqueue branch into -current

kqueue provides a stateful and efficient event notification framework
currently supported events include socket, file, directory, fifo,
pipe, tty and device changes, and monitoring of processes and signals

kqueue is supported by all writable filesystems in NetBSD tree
(with exception of Coda) and all device drivers supporting poll(2)

based on work done by Jonathan Lemon for FreeBSD
initial NetBSD port done by Luke Mewburn and Jason Thorpe


Revision tags: kqueue-beforemerge
# 1.56 14-Oct-2002 gmcgarry

vn_stat() can now take a struct vnode * for consistency. Hide away
the opaque file descriptor operations.


# 1.55 05-Oct-2002 chs

count executable image pages as executable for vm-usage purposes.
also, always do the VTEXT vs. v_writecount mutual exclusion
(which we previously skipped if the text or data segment was empty).


Revision tags: netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base gehenna-devsw-base eeh-devprop-base kqueue-base
# 1.54 17-Mar-2002 atatat

branches: 1.54.6;
Convert ioctl code to use EPASSTHROUGH instead of -1 or ENOTTY for
indicating an unhandled "command". ERESTART is -1, which can lead to
confusion. ERESTART has been moved to -3 and EPASSTHROUGH has been
placed at -4. No ioctl code should now return -1 anywhere. The
ioctl() system call is now properly restartable.


Revision tags: newlock-base ifpoll-base
# 1.53 09-Dec-2001 chs

replace "vnode" and "vtext" with "file" and "exec" in uvmexp field names.


Revision tags: thorpej-mips-cache-base
# 1.52 12-Nov-2001 lukem

add RCSIDs


# 1.51 30-Oct-2001 thorpej

- Add a new vnode flag VEXECMAP, which indicates that a vnode has
executable mappings. Stop overloading VTEXT for this purpose (VTEXT
also has another meaning).
- Rename vn_marktext() to vn_markexec(), and use it when executable
mappings of a vnode are established.
- In places where we want to set VTEXT, set it in v_flag directly, rather
than making a function call to do this (it no longer makes sense to
use a function call, since we no longer overload VTEXT with VEXECMAP's
meaning).

VEXECMAP suggested by Chuq Silvers.


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2
# 1.50 21-Sep-2001 chs

branches: 1.50.2;
use shared locks instead of exclusive for VOP_READ() and VOP_READDIR().


Revision tags: post-chs-ubcperf
# 1.49 15-Sep-2001 chs

a whole bunch of changes to improve performance and robustness under load:

- remove special treatment of pager_map mappings in pmaps. this is
required now, since I've removed the globals that expose the address range.
pager_map now uses pmap_kenter_pa() instead of pmap_enter(), so there's
no longer any need to special-case it.
- eliminate struct uvm_vnode by moving its fields into struct vnode.
- rewrite the pageout path. the pager is now responsible for handling the
high-level requests instead of only getting control after a bunch of work
has already been done on its behalf. this will allow us to UBCify LFS,
which needs tighter control over its pages than other filesystems do.
writing a page to disk no longer requires making it read-only, which
allows us to write wired pages without causing all kinds of havoc.
- use a new PG_PAGEOUT flag to indicate that a page should be freed
on behalf of the pagedaemon when it's unlocked. this flag is very similar
to PG_RELEASED, but unlike PG_RELEASED, PG_PAGEOUT can be cleared if the
pageout fails due to eg. an indirect-block buffer being locked.
this allows us to remove the "version" field from struct vm_page,
and together with shrinking "loan_count" from 32 bits to 16,
struct vm_page is now 4 bytes smaller.
- no longer use PG_RELEASED for swap-backed pages. if the page is busy
because it's being paged out, we can't release the swap slot to be
reallocated until that write is complete, but unlike with vnodes we
don't keep a count of in-progress writes so there's no good way to
know when the write is done. instead, when we need to free a busy
swap-backed page, just sleep until we can get it busy ourselves.
- implement a fast-path for extending writes which allows us to avoid
zeroing new pages. this substantially reduces cpu usage.
- encapsulate the data used by the genfs code in a struct genfs_node,
which must be the first element of the filesystem-specific vnode data
for filesystems which use genfs_{get,put}pages().
- eliminate many of the UVM pagerops, since they aren't needed anymore
now that the pager "put" operation is a higher-level operation.
- enhance the genfs code to allow NFS to use the genfs_{get,put}pages
instead of a modified copy.
- clean up struct vnode by removing all the fields that used to be used by
the vfs_cluster.c code (which we don't use anymore with UBC).
- remove kmem_object and mb_object since they were useless.
instead of allocating pages to these objects, we now just allocate
pages with no object. such pages are mapped in the kernel until they
are freed, so we can use the mapping to find the page to free it.
this allows us to remove splvm() protection in several places.

The sum of all these changes improves write throughput on my
decstation 5000/200 to within 1% of the rate of NetBSD 1.5
and reduces the elapsed time for "make release" of a NetBSD 1.5
source tree on my 128MB pc to 10% less than a 1.5 kernel took.


Revision tags: pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.48 09-Apr-2001 jdolecek

branches: 1.48.2; 1.48.4;
Change the first arg to fileops fo_stat routine to struct file *, adjust
callers and appropriate routines to cope. This makes fo_stat more
consistent with rest of fileops routines and also makes the fo_stat
match FreeBSD as an added bonus.
Discussed with Luke Mewburn on tech-kern@.


# 1.47 07-Apr-2001 jdolecek

Add new 'stat' fileop and call the stat function via f_ops rather
than directly.
For compat syscalls, also add necessary FILE_USE()/FILE_UNUSE().
Now that soo_stat() gets a proc arg, pass it on to usrreq function.


# 1.46 09-Mar-2001 chs

add UBC memory-usage balancing. we track the number of pages in use for
each of the basic types (anonymous data, executable image, cached files)
and prevent the pagedaemon from reusing a given page if that would reduce
the count of that type of page below a sysctl-setable minimum threshold.
the thresholds are controlled via three new sysctl tunables:
vm.anonmin, vm.vnodemin, and vm.vtextmin. these tunables are the
percentages of pageable memory reserved for each usage, and we do not allow
the sum of the minimums to be more than 95% so that there's always some
memory that can be reused.


# 1.45 27-Nov-2000 chs

branches: 1.45.2;
Initial integration of the Unified Buffer Cache project.


# 1.44 12-Aug-2000 sommerfeld

Use ltsleep(...,PNORELOCK..) instead of simple_unlock()/tsleep()


# 1.43 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-PATCH003 netbsd-1-5-PATCH002 netbsd-1-5-PATCH001 netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.42 11-Apr-2000 chs

add a new function vn_marktext() for exec code to let others know
that the vnode is now being used as process text.


# 1.41 30-Mar-2000 augustss

Get rid of register declarations.


# 1.40 30-Mar-2000 simonb

Delete redundant decl of union_vnodeop_p, it's in <miscfs/union/union.h>.


# 1.39 14-Feb-2000 fvdl

Fixes to the softdep code from Ethan Solomita <ethan@geocast.com>.
* Fix buffer ordering when it has dependencies.
* Alleviate memory problems.
* Deal with some recursive vnode locks (sigh).
* Fix other bugs.


Revision tags: chs-ubc2-newbase wrstuden-devbsize-19991221 wrstuden-devbsize-base comdex-fall-1999-base fvdl-softdep-base
# 1.38 31-Aug-1999 bouyer

branches: 1.38.2;
Add a new flag, used by vn_open() which prevent symlinks from being followed
at open time. Use this to prevent coredump to follow symlinks when the
kernel opens/creates the file.


# 1.37 03-Aug-1999 wrstuden

Add support for fcntl(2) to generate VOP_FCNTL calls. Any fcntl
call with F_FSCTL set and F_SETFL calls generate calls to a new
fileop fo_fcntl. Add genfs_fcntl() and soo_fcntl() which return 0
for F_SETFL and EOPNOTSUPP otherwise. Have all leaf filesystems
use genfs_fcntl().

Reviewed by: thorpej
Tested by: wrstuden


Revision tags: kame_141_19991130 netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base
# 1.36 31-Mar-1999 mycroft

branches: 1.36.2; 1.36.4;
Previous change to vn_lock() was bogus. If we got EDEADLK, it was from
lockmgr(), and it already unlocked v_interlock. So, just return in this case.


# 1.35 30-Mar-1999 wrstuden

The mode for a node is a mode_t in both struct stat and struct vattr -
don't use a u_short for intermediate storage in vn_stat.


# 1.34 25-Mar-1999 sommerfe

Prevent deadlock cited in PR4629 from crashing the system. (copyout
and system call now just return EFAULT). A complete fix will
presumably have to wait for UBC and/or for vnode locking protocols to
be revamped to allow use of shared locks.


# 1.33 24-Mar-1999 mrg

completely remove Mach VM support. all that is left is the all the
header files as UVM still uses (most of) these.


# 1.32 26-Feb-1999 wrstuden

Modify VOP_CLOSE vnode op to always take a locked vnode. Change vn_close
to pass down a locked node. Modify union_copyup() to call VOP_CLOSE
locked nodes.

Also fix a bug in union_copyup() where a lock on the lower vnode would
only be released if VOP_OPEN didn't fail.


Revision tags: kenh-if-detach-base chs-ubc-base
# 1.31 02-Aug-1998 kleink

branches: 1.31.2;
Implement support for IEEE Std 1003.1b-1993 syncronous I/O:
* if synchronized I/O file integrity completion of read operations was
requested, set IO_SYNC in the ioflag passed to the read vnode operator.
* if synchronized I/O data integrity completion of write operations was
requested, set IO_DSYNC in the ioflag passed to the write vnode operator.


Revision tags: eeh-paddr_t-base
# 1.30 28-Jul-1998 thorpej

Change the "aresid" argument of vn_rdwr() from an int * to a size_t *,
to match the new uio_resid type.


# 1.29 30-Jun-1998 thorpej

Add two additional arguments to the fileops read and write calls, a
pointer to the offset to use, and a flags word. Define a flag that
specifies whether or not to update the offset passed by reference.


# 1.28 01-Mar-1998 fvdl

Merge with Lite2 + local changes


# 1.27 19-Feb-1998 thorpej

Include the UNION option header.


# 1.26 10-Feb-1998 mrg

- add defopt's for UVM, UVMHIST and PMAP_NEW.
- remove unnecessary UVMHIST_DECL's.


# 1.25 05-Feb-1998 mrg

initial import of the new virtual memory system, UVM, into -current.

UVM was written by chuck cranor <chuck@maria.wustl.edu>, with some
minor portions derived from the old Mach code. i provided some help
getting swap and paging working, and other bug fixes/ideas. chuck
silvers <chuq@chuq.com> also provided some other fixes.

this is the rest of the MI portion changes.

this will be KNF'd shortly. :-)


# 1.24 14-Jan-1998 thorpej

Grab a fix from 4.4BSD-Lite2: open(2) with O_FSYNC and MNT_SYNCHRONOUS
had not effect. Fix: check for either of these flags in vn_write(),
and pass IO_SYNC down if they're set.


Revision tags: netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base marc-pcmcia-base
# 1.23 10-Oct-1997 fvdl

branches: 1.23.2;
Add vn_readdir function for use in both the old getdirentries and
the new getdents(). Add getdents().


Revision tags: thorpej-signal-base marc-pcmcia-bp
# 1.22 24-Mar-1997 mycroft

branches: 1.22.4;
Do not return generation counts to the user.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.21 07-Sep-1996 mycroft

Implement poll(2).


Revision tags: netbsd-1-2-PATCH001 netbsd-1-2-RELEASE netbsd-1-2-BETA netbsd-1-2-base
# 1.20 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.19 23-May-1995 mycroft

Remove gratuitous extra indirections.


# 1.18 14-Dec-1994 mycroft

Remove extra arg to vn_open().


# 1.17 13-Dec-1994 mycroft

LEASE_CHECK -> VOP_LEASE


# 1.16 14-Nov-1994 christos

added extra argument in vn_open and VOP_OPEN to allow cloning devices


# 1.15 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.14 18-Sep-1994 mycroft

Fix space change in last commit.


# 1.13 14-Sep-1994 cgd

from Kirk McKusick: release old ctty if acquiring a new one.
also: prettiness police!


Revision tags: netbsd-1-0-base
# 1.12 29-Jun-1994 cgd

branches: 1.12.2;
New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.11 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.10 17-May-1994 cgd

copyright foo


# 1.9 25-Apr-1994 cgd

some prototype cleanup, eliminate/replace bogus types (e.g. quad and
u_quad) -> use better types (e.g. quad_t & u_quad_t in inodes),
some cleanup.


# 1.8 12-Apr-1994 chopps

FIONREAD returns int not off_t. (ssize_t prefered, but standards may
dictate otherwise)


# 1.7 21-Dec-1993 cgd

kill two wrong 'case's


# 1.6 18-Dec-1993 mycroft

Canonicalize all #includes.


Revision tags: magnum-base
# 1.5 07-Sep-1993 ws

branches: 1.5.2;
Changes to VFS readdir semantics
NFS changes for better cookie support
ISOFS changes for better Rockridge support and support for generation numbers


# 1.4 24-Aug-1993 pk

Support added for proc filesystem.


Revision tags: netbsd-0-9-patch-001 netbsd-0-9-RELEASE netbsd-0-9-BETA netbsd-0-9-ALPHA2 netbsd-0-9-ALPHA netbsd-0-9-base
# 1.3 22-May-1993 cgd

add include of select.h if necessary for protos, or delete if extraneous


# 1.2 18-May-1993 cgd

make kernel select interface be one-stop shopping & clean it all up.


# 1.1 21-Mar-1993 cgd

branches: 1.1.1;
Initial revision