History log of /freebsd-10-stable/sys/nfsclient/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
317525 27-Apr-2017 rmacklem

MFC: r316792
Add an NFSv4.1 mount option for "use one openowner".

Some NFSv4.1 servers such as AmazonEFS can only support a small fixed number
of open_owner4s. This patch adds a mount option called "oneopenown" that
can be used for NFSv4.1 mounts to make the client do all Opens with the
same open_owner4 string. This option can only be used with NFSv4.1 and
may not work correctly when Delegations are is use.

Differential Revision: https://reviews.freebsd.org/D8988

276500 01-Jan-2015 kib

MFC r275897:
Set NOCACHE flag for CREATE namei() calls, do not specially handle
MAKEENTRY in VOP_LOOKUP().

260107 30-Dec-2013 rmacklem

MFC: r259084
For software builds, the NFS client does many small
synchronous (with FILE_SYNC) writes because non-contiguous
byte ranges in the same buffer cache block are being
written. This patch adds a new mount option "noncontigwr"
which allows the non-contiguous byte ranges to be combined,
with the dirty byte range becoming the superset of the bytes
that are dirty, if the file has not been file locked.
This reduces the number of writes significantly for software
builds. The only case where this change might break existing
applications is where an application is writing
non-overlapping byte ranges within the same buffer cache block
of a file from multiple clients concurrently.
Since such an application would normally do file locking on
the file, avoiding the byte range merge for files that have
been file locked should be sufficient for most (maybe all?) cases.

256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


252673 04-Jul-2013 rmacklem

A problem with the old NFS client where large writes to large files
would sometimes result in a corrupted file was reported via email.
This problem appears to have been caused by r251719 (reverting
r251719 fixed the problem). Although I have not been able to
reproduce this problem, I suspect it is caused by another thread
increasing np->n_size after the mtx_unlock(&np->n_mtx) but before
the vnode_pager_setsize() call. Since the np->n_mtx mutex serializes
updates to np->n_size, doing the vnode_pager_setsize() with the
mutex locked appears to avoid the problem.
Unfortunately, vnode_pager_setsize() where the new size is smaller,
cannot be called with a mutex held.
This patch returns the semantics to be close to pre-r251719 such that the
call to the vnode_pager_setsize() is only delayed until after the mutex is
unlocked when np->n_size is shrinking. Since the file is growing
when being written, I believe this will fix the corruption.

Reported by: David G. Lawrence (dg@dglawrence.com)
Tested by: David G. Lawrence (pending, to happen soon)
Reviewed by: kib
MFC after: 1 week


252479 01-Jul-2013 rmacklem

A recent version of the oldnfs NFS client in head/current
will crash when doing a large write, since m_get2() would
return NULL. This patch fixes the problem, since nfsm_uiotombuf()
will allocate additional mbufs, as required.

Reported by: sbruno
Tested by: sbruno
Discussed with: glebius


251171 31-May-2013 jeff

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


251089 29-May-2013 rmacklem

Add a patch analygous to r248567, r248581, r251079 to the
old NFS client to avoid the panic reported in the PR by
doing the vnode_pager_setsize() call after unlocking the mutex.

PR: 177335
MFC after: 2 weeks


249630 18-Apr-2013 rmacklem

When an NFS unmount occurs, once vflush() writes the last dirty
buffer for the last vnode on the mount back to the server, it
returns. At that point, the code continues with the unmount,
including freeing up the nfs specific part of the mount structure.
It is possible that an nfsiod thread will try to check for an
empty I/O queue in the nfs specific part of the mount structure
after it has been free'd by the unmount. This patch avoids this problem by
setting the iodmount entries for the mount back to NULL while holding the
mutex in the unmount and checking the appropriate entry is non-NULL after
acquiring the mutex in the nfsiod thread.

Reported and tested by: pho
Reviewed by: kib
MFC after: 2 weeks


249623 18-Apr-2013 rmacklem

Both NFS clients can deadlock when using the "rdirplus" mount
option. This can occur when an nfsiod thread that already holds
a buffer lock attempts to acquire a vnode lock on an entry in
the directory (a LOR) when another thread holding the vnode lock
is waiting on an nfsiod thread. This patch avoids the deadlock by disabling
readahead for this case, so the nfsiod threads never do readdirplus.
Since readaheads for directories need the directory offset cookie
from the previous read, they cannot normally happen in parallel.
As such, testing by jhb@ and myself didn't find any performance
degredation when this patch is applied. If there is a case where
this results in a significant performance degradation, mounting
without the "rdirplus" option can be done to re-enable readahead
for directories.

Reported and tested by: jhb
Reviewed by: jhb
MFC after: 2 weeks


248500 19-Mar-2013 emaste

Fix remainder calculation when biosize is not a power of 2

In common configurations biosize is a power of two, but is not required to
be so. Thanks to markj@ for spotting an additional case beyond my original
patch.

Reviewed by: rmacklem@


248255 13-Mar-2013 jhb

Revert 195703 and 195821 as this special stop handling in NFS is now
implemented via VFCF_SBDRY rather than passing PBDRY to individual
sleep calls.


248207 12-Mar-2013 glebius

Functions m_getm2() and m_get2() have different order of arguments,
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().

Sorry for churn, but better now than later.


248198 12-Mar-2013 glebius

- Use m_get2() instead of nfsm_reqhead().
- Use m_get(), m_getcl() instead of historic macros.

Sponsored by: Nginx, Inc.


248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


247116 21-Feb-2013 jhb

Further refine the handling of stop signals in the NFS client. The
changes in r246417 were incomplete as they did not add explicit calls to
sigdeferstop() around all the places that previously passed SBDRY to
_sleep(). In addition, nfs_getcacheblk() could trigger a write RPC from
getblk() resulting in sigdeferstop() recursing. Rather than manually
deferring stop signals in specific places, change the VFS_*() and VOP_*()
methods to defer stop signals for filesystems which request this behavior
via a new VFCF_SBDRY flag. Note that this has to be a VFC flag rather than
a MNTK flag so that it works properly with VFS_MOUNT() when the mount is
not yet fully constructed. For now, only the NFS clients are set this new
flag in VFS_SET().

A few other related changes:
- Add an assertion to ensure that TDF_SBDRY doesn't leak to userland.
- When a lookup request uses VOP_READLINK() to follow a symlink, mark
the request as being on behalf of the thread performing the lookup
(cnp_thread) rather than using a NULL thread pointer. This causes
NFS to properly handle signals during this VOP on an interruptible
mount.

PR: kern/176179
Reported by: Russell Cattelan (sigdeferstop() recursion)
Reviewed by: kib
MFC after: 1 month


246417 06-Feb-2013 jhb

Rework the handling of stop signals in the NFS client. The changes in
195702, 195703, and 195821 prevented a thread from suspending while holding
locks inside of NFS by forcing the thread to fail sleeps with EINTR or
ERESTART but defer the thread suspension to the user boundary. However,
this had the effect that stopping a process during an NFS request could
abort the request and trigger EINTR errors that were visible to userland
processes (previously the thread would have suspended and completed the
request once it was resumed).

This change instead effectively masks stop signals while in the NFS client.
It uses the existing TDF_SBDRY flag to effect this since SIGSTOP cannot
be masked directly. Also, instead of setting PBDRY on individual sleeps,
the NFS client now sets the TDF_SBDRY flag around each NFS request and
stop signals are masked for all sleeps during that region (the previous
change missed sleeps in lockmgr locks). The end result is that stop
signals sent to threads performing an NFS request are completely
ignored until after the NFS request has finished processing and the
thread prepares to return to userland. This restores the behavior of
stop signals being transparent to userland processes while still
preventing threads from suspending while holding NFS locks.

Reviewed by: kib
MFC after: 1 month


245909 25-Jan-2013 jhb

Further cleanups to use of timestamps in NFS:
- Use NFSD_MONOSEC (which maps to time_uptime) instead of the seconds
portion of wall-time stamps to manage timeouts on events.
- Remove unused nd_starttime from the per-request structure in the new
NFS server.
- Use nanotime() for the modification time on a delegation to get as
precise a time as possible.
- Use time_second instead of extracting the second from a call to
getmicrotime().

Submitted by: bde (3)
Reviewed by: bde, rmacklem
MFC after: 2 weeks


245611 18-Jan-2013 jhb

Use vfs_timestamp() to set file timestamps rather than invoking
getmicrotime() or getnanotime() directly in NFS.

Reviewed by: rmacklem, bde
MFC after: 1 week


245508 16-Jan-2013 jhb

Use the VA_UTIMES_NULL flag to detect when NULL was passed to utimes()
instead of comparing the desired time against the current time as a
heuristic.

Reviewed by: rmacklem
MFC after: 1 week


245476 15-Jan-2013 jhb

- More properly handle interrupted NFS requests on an interruptible mount
by returning an error of EINTR rather than EACCES.
- While here, bring back some (but not all) of the NFS RPC statistics lost
when krpc was committed.

Reviewed by: rmacklem
MFC after: 1 week


244042 08-Dec-2012 rmacklem

Move the NFSv4.1 client patches over from projects/nfsv4.1-client
to head. I don't think the NFS client behaviour will change unless
the new "minorversion=1" mount option is used. It includes basic
NFSv4.1 support plus support for pNFS using the Files Layout only.
All problems detecting during an NFSv4.1 Bakeathon testing event
in June 2012 have been resolved in this code and it has been tested
against the NFSv4.1 server available to me.
Although not reviewed, I believe that kib@ has looked at it.


243882 05-Dec-2012 glebius

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


243311 19-Nov-2012 attilio

r16312 is not any longer real since many years (likely since when VFS
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.

Removes comments that makes no sense now.

Reviewed by: kib
MFC after: 3 days


242833 09-Nov-2012 attilio

Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.


239246 14-Aug-2012 kib

Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by: alc
MFC after: 1 week


239065 05-Aug-2012 kib

After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by: alc
MFC after: 2 weeks


239040 04-Aug-2012 kib

Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by: alc
MFC after: 1 week


235332 12-May-2012 rmacklem

PR# 165923 reported intermittent write failures for dirty
memory mapped pages being written back on an NFS mount.
Since any thread can call VOP_PUTPAGES() to write back a
dirty page, the credentials of that thread may not have
write access to the file on an NFS server. (Often the uid
is 0, which may be mapped to "nobody" in the NFS server.)
Although there is no completely correct fix for this
(NFS servers check access on every write RPC instead of at
open/mmap time), this patch avoids the common cases by
holding onto a credential that recently opened the file
for writing and uses that credential for the write RPCs
being done by VOP_PUTPAGES() for both NFS clients.

Tested by: Joel Ray Holveck (joelh at juniper.net)
PR: kern/165923
Reviewed by: kib
MFC after: 2 weeks


235246 10-May-2012 mckusick

Fix mount mutex handling missed in r234386.


235052 05-May-2012 pluknet

Fix mount mutex handling missed in r234386.


234605 23-Apr-2012 trasz

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


234386 17-Apr-2012 mckusick

Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


232821 11-Mar-2012 kib

Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by: gianni


232420 03-Mar-2012 rmacklem

Post r230394, the Lookup RPC counts for both NFS clients increased
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.

Reported by: bde
Discussed with: kib
Reviewed by: jhb
MFC after: 2 weeks


232327 01-Mar-2012 rmacklem

Fix the NFS clients so that they use copyin() instead of bcopy(),
when doing direct I/O. This direct I/O code is not enabled by default.

Submitted by: kib (earlier version)
Reviewed by: kib
MFC after: 1 week


232116 24-Feb-2012 jhb

Adjust the nfs_skip_wcc_data_onerr setting so that it does not block
post-op attributes for ENOENT errors now that the name caching logic
depends on working post-op attributes.

MFC after: 2 weeks


231949 21-Feb-2012 kib

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


231852 17-Feb-2012 bz

Merge multi-FIB IPv6 support from projects/multi-fibv6/head/:

Extend the so far IPv4-only support for multiple routing tables (FIBs)
introduced in r178888 to IPv6 providing feature parity.

This includes an extended rtalloc(9) KPI for IPv6, the necessary
adjustments to the network stack, and user land support as in netstat.

Sponsored by: Cisco Systems, Inc.
Reviewed by: melifaro (basically)
MFC after: 10 days


231088 06-Feb-2012 jhb

Rename cache_lookup_times() to cache_lookup() and retire the old API and
ABI stub for cache_lookup().


231075 06-Feb-2012 kib

Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks


230803 31-Jan-2012 rmacklem

When a "mount -u" switches an NFS mount point from TCP to UDP,
any thread doing an I/O RPC with a transfer size greater than
NFS_UDPMAXDATA will be hung indefinitely, retrying the RPC.
After a discussion on freebsd-fs@, I decided to add a warning
message for this case, as suggested by Jeremy Chadwick.

Suggested by: freebsd at jdc.parodius.com (Jeremy Chadwick)
MFC after: 2 weeks


230605 27-Jan-2012 rmacklem

A problem with respect to data read through the buffer cache for both
NFS clients was reported to freebsd-fs@ under the subject "NFS
corruption in recent HEAD" on Nov. 26, 2011. This problem occurred when
a TCP mounted root fs was changed to using UDP. I believe that this
problem was caused by the change in mnt_stat.f_iosize that occurred
because rsize was decreased to the maximum supported by UDP. This
patch fixes the problem by using v_bufobj.bo_bsize instead of f_iosize,
since the latter is set to f_iosize when the vnode is allocated, but
does not change for a given vnode when f_iosize changes.

Reported by: pjd
Reviewed by: kib
MFC after: 2 weeks


230559 26-Jan-2012 rmacklem

Revert r230516, since it doesn't really fix the problem.


230552 25-Jan-2012 kib

Fix remaining calls to cache_enter() in both NFS clients to provide
appropriate timestamps. Restore the assertions which verify that
NCF_TS is set when timestamp is asked for.

Reviewed by: jhb (previous version)
MFC after: 2 weeks


230547 25-Jan-2012 jhb

Add a timeout on positive name cache entries in the NFS client. That is,
we will only trust a positive name cache entry for a specified amount of
time before falling back to a LOOKUP RPC, even if the ctime for the file
handle matches the cached copy in the name cache entry. The timeout is
configured via a new 'nametimeo' mount option and defaults to 60 seconds.
It may be set to zero to disable positive name caching entirely.

Reviewed by: rmacklem
MFC after: 1 week


230516 25-Jan-2012 rmacklem

If a mount -u is done to either NFS client that switches it
from TCP to UDP and the rsize/wsize/readdirsize is greater
than NFS_MAXDGRAMDATA, it is possible for a thread doing an
I/O RPC to get stuck repeatedly doing retries. This happens
because the RPC will use a resize/wsize/readdirsize that won't
work for UDP and, as such, it will keep failing indefinitely.
This patch returns an error for this case, to avoid the problem.
A discussion on freebsd-fs@ seemed to indicate that returning
an error was preferable to silently ignoring the "udp"/"mntudp"
option.
This problem was discovered while investigating a problem reported
by pjd@ via email.

MFC after: 2 weeks


230394 20-Jan-2012 jhb

Close a race in NFS lookup processing that could result in stale name cache
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.

MFC after: 2 weeks


230249 17-Jan-2012 mckusick

Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks


228757 21-Dec-2011 rmacklem

jwd@ reported a problem via email where the old NFS client would
get a reply of EEXIST from an NFS server when a Mkdir RPC was retried,
for an NFS over UDP mount.
Upon investigation, it was found that the client was retransmitting
the Mkdir RPC request over UDP, but with a different xid. As such,
the retransmitted message would miss the Duplicate Request Cache
in the server, causing it to reply EEXIST. The kernel client side
UDP rpc code has two timers. The first one causes a retransmit using
the same xid and socket and was set to a fixed value of 3seconds.
(The default can be overridden via CLSET_RETRY_TIMEOUT.)
The second one creates a new socket and xid and should be larger
than the first. However, both NFS clients were setting the second
timer to nm_timeo ("timeout=<value>" mount argument), which defaulted to
1second, so the first timer would never time out.
This patch fixes both NFS clients so that they set the first timer
using nm_timeo and makes the second timer larger than the first one.

Reported by: jwd
Tested by: jwd
Reviewed by: jhb
MFC after: 2 weeks


228156 30-Nov-2011 kib

Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by: attilio, alc


227690 19-Nov-2011 rmacklem

The old NFS client will crash due to the reply being m_freem()'d
twice if the server bogusly returns an error with the NFSERR_RETERR
bit (bit 31) set. No actual NFS error has this bit set, but it seems
that amd will sometimes do this. This patch makes sure the NFSERR_RETERR
bit is cleared to avoid a crash.

PR: kern/153847
MFC after: 2 weeks


227507 14-Nov-2011 jhb

Finish making 'wcommitsize' an NFS client mount option.

Reviewed by: rmacklem
MFC after: 1 week


224733 09-Aug-2011 jhb

Merge 220876, 220877, and 221537 from the new NFS client to the old:
Allow the NFS client to use a max file size larger than 1TB for v3 mounts.
It now allows files up to OFF_MAX subject to whatever limit the server
advertises.

Reviewed by: rmacklem
Approved by: re (kib)
MFC after: 1 week


224604 02-Aug-2011 rmacklem

Fix a LOR in the NFS client which could cause a deadlock.
This was reported to the mailing list freebsd-net@freebsd.org
on July 21, 2011 under the subject "LOR with nfsclient sillyrename".
The LOR occurred when nfs_inactive() called vrele(sp->s_dvp)
while holding the vnode lock on the file in s_dvp. This patch
modifies the client so that it performs the vrele(sp->s_dvp)
as a separate task to avoid the LOR. This fix was discussed
with jhb@ and kib@, who both proposed variations of it.

Tested by: pho, jlott at averesystems.com
Submitted by: jhb (earlier version)
Reviewed by: kib
Approved by: re (kib)
MFC after: 2 weeks


223309 19-Jun-2011 rmacklem

Fix the kgssapi so that it can be loaded as a module. Currently
the NFS subsystems use five of the rpcsec_gss/kgssapi entry points,
but since it was not obvious which others might be useful, all
nineteen were included. Basically the nineteen entry points are
set in a structure called rpc_gss_entries and inline functions
defined in sys/rpc/rpcsec_gss.h check for the entry points being
non-NULL and then call them. A default value is returned otherwise.
Requested by rwatson.

Reviewed by: jhb
MFC after: 2 weeks


222586 01-Jun-2011 kib

In the VOP_PUTPAGES() implementations, change the default error from
VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return
VM_PAGER_AGAIN for the partially written page. Always forward at least
one page in the loop of vm_object_page_clean().

VM_PAGER_ERROR causes the page reactivation and does not clear the
page dirty state, so the write is not lost.

The change fixes an infinite loop in vm_object_page_clean() when the
filesystem returns permanent errors for some page writes.

Reported and tested by: gavin
Reviewed by: alc, rmacklem
MFC after: 1 week


222464 29-May-2011 rmacklem

Add a check for MNTK_UNMOUNTF at the beginning of nfs_sync()
in the old NFS client so that a forced dismount doesn't
get stuck in the VFS_SYNC() call that happens before
VFS_UNMOUNT() in dounmount(). Analagous to r222329 for the new NFS client.
An additional change is needed before forced dismounts will work.

PR: kern/157365
MFC after: 2 weeks


222187 22-May-2011 alc

Eliminate duplicate #include's.


222075 18-May-2011 rmacklem

Add a sanity check for the existence of an "addr" option
to both NFS clients. This avoids the crash reported by
Sergey Kandaurov (pluknet@gmail.com) to the freebsd-fs@
list with subject "[old nfsclient] different nmount()
args passed from mount vs mount_nfs" dated May 17, 2011.

Tested by: pluknet at gmail.com (old nfs client)
MFC after: 2 weeks


221986 16-May-2011 rmacklem

Fix a comment that got missed by r221973 which changed
the sysctl naming for the old NFS client to vfs.oldnfs.


221973 15-May-2011 rmacklem

Change the sysctl naming for the old and new NFS clients
to vfs.oldnfs.xxx and vfs.nfs.xxx respectively. This makes
the default nfs client use vfs.nfs.xxx after r221124.


221543 06-May-2011 rmacklem

Move sys/nfsclient/nfs_kdtrace.h to sys/nfs/nfs_kdtrace.h so
it can be used by the new NFS client as well as the old one.


221542 06-May-2011 rmacklem

Fix the module dependency in nfs_kdtrace.c for the old NFS
client. This should fix a problem reported by Marcus Reid.


221436 04-May-2011 ru

Implemented a mount option "nocto" that disables cache coherency
checking at open time. It may improve performance for read-only
NFS mounts. Use deliberately.

MFC after: 1 week
Reviewed by: rmacklem, jhb (earlier version)


221139 27-Apr-2011 rmacklem

Fix module names and dependencies so the NFS clients will
load correctly as modules after r221124.


221124 27-Apr-2011 rmacklem

This patch changes head so that the default NFS client is now the new
NFS client (which I guess is no longer experimental). The fstype "newnfs"
is now "nfs" and the regular/old NFS client is now fstype "oldnfs".
Although mounts via fstype "nfs" will usually work without userland
changes, an updated mount_nfs(8) binary is needed for kernels built with
"options NFSCL" but not "options NFSCLIENT". Updated mount_nfs(8) and
mount(8) binaries are needed to do mounts for fstype "oldnfs".
The GENERIC kernel configs have been changed to use options
NFSCL and NFSD (the new client and server) instead of NFSCLIENT and NFSSERVER.
For kernels being used on diskless NFS root systems, "options NFSCL"
must be in the kernel config.
Discussed on freebsd-fs@.


221066 26-Apr-2011 rmacklem

Fix a kernel linking problem introduced by r221032, r221040
when building kernels that don't have "options NFS_ROOT"
specified. I plan on moving the functions that use these
data structures into the shared code in sys/nfs/nfs_diskless.c
in a future commit. At that time, these definitions will no
longer be needed in nfs_vfsops.c and nfs_clvfsops.c.

MFC after: 2 weeks


221032 25-Apr-2011 rmacklem

Fix the experimental NFS client so that it does not bogusly
set the f_flags field of "struct statfs". This had the interesting
effect of making the NFSv4 mounts "disappear" after r221014,
since NFSMNT_NFSV4 and MNT_IGNORE became the same bit.
Move the files used for a diskless NFS root from sys/nfsclient
to sys/nfs in preparation for them to be used by both NFS
clients. Also, move the declaration of the three global data
structures from sys/nfsclient/nfs_vfsops.c to sys/nfs/nfs_diskless.c
so that they are defined when either client uses them.

Reviewed by: jhb
MFC after: 2 weeks


221014 25-Apr-2011 rmacklem

Modify the experimental NFS client so that it uses the same
"struct nfs_args" as the regular NFS client. This is needed
so that the old mount(2) syscall will work and it makes
sharing of the diskless NFS root code easier. Eary in the
porting exercise I introduced a new revision of nfs_args, but
didn't actually need it, thanks to nmount(2). I re-introduced the
NFSMNT_KERB flag, since it does essentially the same thing and
the old one would not have been used because it never worked.
I also added a few new NFSMNT_xxx flags to sys/nfsclient/nfs_args.h
that are used by the experimental NFS client.

MFC after: 2 weeks


220595 13-Apr-2011 ru

- Fixed nfs_printf() to use vprintf().
- Fixed vfs.nfs.acdebug sysctl's description.
- Fixed panic when compiled with NFS_ACDEBUG.

MFC after: 3 days


219028 25-Feb-2011 netchild

Add some FEATURE macros for various features (AUDIT/CAM/IPC/KTR/MAC/NFS/NTP/
PMC/SYSV/...).

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by: Google Summer of Code 2010
Submitted by: kibab
Reviewed by: arch@ (parts by rwatson, trasz, jhb)
X-MFC after: to be determined in last commit with code from this project


218757 16-Feb-2011 bz

Mfp4 CH=177274,177280,177284-177285,177297,177324-177325

VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.

While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.

The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec

Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks


216931 03-Jan-2011 rmacklem

Fix the nlm so that it no longer depends on the regular
nfs client and, as such, can be loaded for the experimental
nfs client without the regular client.

Reviewed by: jhb
MFC after: 2 weeks


215548 19-Nov-2010 kib

Remove prtactive variable and related printf()s in the vop_inactive
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.

MFC after: 1 week
X-MFC-note: keep prtactive symbol in vfs_subr.c


214418 27-Oct-2010 jh

Add missing "readahead" to the nfs_opts list.

PR: 151321
Tested by: Simon Walton
MFC after: 2 weeks


214053 19-Oct-2010 rmacklem

Fix the type of the 3rd argument for nm_getinfo so that it works
for architectures like sparc64.

Suggested by: kib
MFC after: 2 weeks


214048 19-Oct-2010 rmacklem

Modify the NFS clients and the NLM so that the NLM can be used
by both clients. Since the NLM uses various fields of the
nfsmount structure, those fields were extracted and put in a
separate nfs_mountcommon structure stored in sys/nfs/nfs_mountcommon.h.
This structure also has a function pointer for a function that
extracts the required information from the mount point and nfs vnode
for that particular client, for information stored differently by the
clients.

Reviewed by: jhb
MFC after: 2 weeks


214026 18-Oct-2010 kib

Do not synchronously start the nfsiod threads at all. The r212506
fixed the issues with file descriptor locks, but the same problems are
present for vnode lock/user map lock.

If the nfs_asyncio() cannot find the free nfsiod, schedule task to
create new nfsiod and return error. This causes fall back to the
synchronous i/o for nfs_strategy(), or does not start read at all in
the case of readahead. The caller that holds vnode and potentially
user map lock does not wait for kproc_create() to finish, preventing
the LORs.

The change effectively reverts r203072, because we never hand off the
request to newly created nfsiod thread anymore.

Reviewed by: jhb
Tested by: jhb, pluknet
MFC after: 3 weeks


212506 12-Sep-2010 kib

Do not fork nfsiod directly from the vop methods. This causes LORs between
vnode lock and several locks needed during fork, like fd lock.

Instead, schedule the task to be executed in the taskqueue context. We
still waiting for the fork to finish, but the context of the thread
executing the task does not make real LORs with our vnode lock.

Submitted by: pluknet at gmail com
Reviewed by: jhb
Tested by: pho
MFC after: 3 weeks


212293 07-Sep-2010 jhb

Store the full timestamp when caching timestamps of files and
directories for purposes of validating name cache entries. This
closes races where two updates to a file or directory within the same
second could result in stale entries in the name cache. While here,
remove the 'n_expiry' field as it is no longer used.

Reviewed by: rmacklem
MFC after: 1 week


212123 01-Sep-2010 rmacklem

Modify nfs_diskless.c to recognize the environment variable
boot.nfsroot.nfshandlelen and set the diskless root fs to
use NFSv3 and this file handle length when it is set. If
this environment variable is not set, the diskless root fs
will use NFSv2 and the same defaults as before. This fixes
the problem where the diskless nfs root fs had to be on a
FreeBSD server for NFSv3 to work, because it did not know
the correct file handle length and assumed the size used
by FreeBSD. Until pxeboot and loader are replaced by ones
built from commits coming soon, boot.nfsroot.nfshandlelen
will not be set by them and the diskless root fs will use
NFSv2 unless the /etc/fstab entry has the "nfsv3" option
specified.

Tested by: danny at cs.huji.ac.il
MFC after: 2 weeks


211531 20-Aug-2010 jhb

Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created. Use them to implement macros that
otherwise manipulated the flags directly. Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe. This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by: kib
MFC after: 3 days


210834 04-Aug-2010 rmacklem

Add some mutex locking on the nfsnode to the regular NFS client.

Reviewed by: jhb


210455 24-Jul-2010 rmacklem

Move sys/nfsclient/nfs_lock.c into sys/nfs and build it as a separate
module that can be used by both the regular and experimental nfs
clients. This fixes the problem reported by jh@ where /dev/nfslock
would be registered twice when both nfs clients were used.
I also defined the size of the lm_fh field to be the correct value,
as it should be the maximum size of an NFSv3 file handle.

Reviewed by: jh
MFC after: 2 weeks


210136 15-Jul-2010 jhb

Retire the NFS access cache timestamp structure. It was used in VOP_OPEN()
to avoid sending multiple ACCESS/GETATTR RPCs during a single open()
between VOP_LOOKUP() and VOP_OPEN(). Now we always send the RPC in
VOP_LOOKUP() and not VOP_OPEN() in the cases that multiple RPCs could be
sent.

MFC after: 2 weeks


209948 12-Jul-2010 jhb

A previous change moved the GETATTR RPC for open() calls that hit in the
name cache up into nfs_lookup() instead of nfs_open(). Continue this
trend by flushing the attribute cache for leaf nodes in nfs_lookup() during
an open() if we do a LOOKUP RPC. For NFSv3 this should generally be a NOP
as the attributes are flushed before fetching the post-op attributes from
the LOOKUP RPC which most (all?) NFSv3 servers provide, so the post-op
attributes should populate the cache.

Now all NFS open() calls will always clear the cached attributes during the
nfs_lookup() prior to nfs_open() in the !NMODIFIED case to provide CTOC.
As a result, we can remove the conditional flushing of the attribute
cache from nfs_open().

Reviewed by: rmacklem, bde
MFC after: 2 weeks


209946 12-Jul-2010 jhb

- Add missing locking around flushing of an NFS node's attribute cache
in the NMODIFIED case of nfs_open().
- Cosmetic tweak to simplify an expression in nfs_lookup().

Reviewed by: rmacklem, bde
MFC after: 1 week


209120 13-Jun-2010 kib

In NFS clients, instead of inconsistently using #ifdef
DIAGNOSTIC and #ifndef DIAGNOSTIC for debug assertions, prefer
KASSERT(). Also change one #ifdef DIAGNOSTIC in the new nfs server.

Submitted by: Mikolaj Golub <to.my.trociny gmail com>
MFC after: 2 weeks


208605 27-May-2010 delphij

Fix build: newnp represents newvp so KDTRACE_NFS_ATTRCACHE_FLUSH_DONE()
on newvp instead of vp here.


208603 27-May-2010 jhb

More gracefully handle stale file handles and attributes when opening a
file via NFS. Specifically, to satisfy close-to-open-consistency, the NFS
client always performs at least one RPC on a file during an open(2) to see
if the file has changed. Normally this RPC is an ACCESS or GETATTR RPC
that is forced by flushing a file's attribute cache during nfs_open() and
then requesting new attributes. However, if the file is noticed to be
stale during nfs_open(), the only recourse is to fail the open(2) call
with ESTALE. On the other hand, if the ACCESS or GETATTR RPC is sent
during nfs_lookup(), then the NFS client can fall back to a LOOKUP RPC to
obtain the new file handle in the case that a file has been replaced.

This change causes the NFS client to flush the attribute cache during
nfs_lookup() when validating a name cache hit if the attributes fetched
during nfs_lookup() can be reused in nfs_open(). This allows the client
to open a replaced file via the new file handle the first time that it
notices a replaced file rather than failing with ESTALE in some cases.

Reviewed by: rmacklem, bde
Reviewed by: mohans (older version)
MFC after: 1 week


208586 27-May-2010 cperciva

Change the current working directory to be inside the jail created by
the jail(8) command. [10:04]

Fix a one-NUL-byte buffer overflow in libopie. [10:05]

Correctly sanity-check a buffer length in nfs mount. [10:06]

Approved by: so (cperciva)
Approved by: re (kensmith)
Security: FreeBSD-SA-10:04.jail
Security: FreeBSD-SA-10:05.opie
Security: FreeBSD-SA-10:06.nfsclient


207746 07-May-2010 alc

Push down the page queues lock into vm_page_activate().


207728 06-May-2010 alc

Eliminate page queues locking around most calls to vm_page_free().


207669 05-May-2010 alc

Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held. (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements. It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib


207662 05-May-2010 trasz

Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().

Reviewed by: kib


207584 03-May-2010 kib

Lock the page around vm_page_activate() and vm_page_deactivate() calls
where it was missed. The wrapped fragments now protect wire_count with
page lock.

Reviewed by: alc


204063 18-Feb-2010 pjd

Simplify code a bit.


203968 16-Feb-2010 marius

Factor out the code shared between NFS client and server into its own
module. With r203732 it became apparent that creating the sysctl nodes
twice causes at least a warning, however the whole code shouldn't be
present twice in the first place.

Discussed with: rmacklem


203732 09-Feb-2010 marius

- Move nfs_realign() from the NFS client to the shared NFS code and
remove the NFS server version in order to reduce code duplication.
The shared version now uses a second parameter how, which is passed
on to m_get(9) and m_getcl(9) as the server used M_WAIT while the
client requires M_DONTWAIT, and replaces the the previously unused
parameter hsiz.
- Change nfs_realign() to use nfsm_aligned() so as with other NFS code
the alignment check isn't actually performed on platforms without
strict alignment requirements for performance reasons because as the
comment suggests unaligned data only occasionally occurs with TCP.
- Change fha_extract_info() to use nfs_realign() with M_DONTWAIT rather
than M_WAIT because it's called with the RPC sp_lock held.

Reviewed by: jhb, rmacklem
MFC after: 1 week


203731 09-Feb-2010 marius

Some style(9) fixes


203072 27-Jan-2010 rmacklem

Fix a race that can occur when nfs nfsiod threads are being created.
Without this patch it was possible for a different thread that calls
nfs_asyncio() to snitch a newly created nfsiod thread that was
intended for another caller of nfs_asyncio(), because the nfs_iod_mtx
mutex was unlocked while the new nfsiod thread was created. This patch
labels the newly created nfsiod, so that it is not taken by another
caller of nfs_asyncio(). This is believed to fix the problem reported
on the freebsd-stable email list under the subject:
FreeBSD NFS client/Linux NFS server issue.

Tested by: to DOT my DOT trociny AT gmail DOT com
Reviewed by: jhb
MFC after: 2 weeks


202774 21-Jan-2010 rmacklem

Fix a typo in a comment introduced by r202767.

MFC after: 2 weeks


202767 21-Jan-2010 rmacklem

Add a timeout for the negative name cache entries in the NFS client.
This avoids a bogus negative name cache entry from persisting forever
when another client creates an entry with the same name within the
same NFS server time of day clock tick. The mount option negnametimeo
can be used to override the default timeout interval on a
per-mount-point basis. Setting negnametimeo to 0 disables negative
name caching for the mount point.
I also fixed one obvious typo where args.timeo should be
args.maxgrouplist.

Submitted by: jhb (earlier version)
Reviewed by: jhb
MFC after: 2 weeks


201895 09-Jan-2010 zec

Reduce recursions on curvnet and thus spamming the console with warning
messages for kernels built with options VIMAGE and VNET_DEBUG enabled.

Reviewed by: bz
MFC after: 3 days


201758 07-Jan-2010 mbr

Remove extraneous semicolons, no functional changes.

Submitted by: Marc Balmer <marc@msys.ch>
MFC after: 1 week


201044 27-Dec-2009 bz

Add missing include to make LINT-VIMAGE build as well.

Found by: test building an MFC candidate
X-MFC with: r200471


200471 13-Dec-2009 bz

Add a few more V_hacks to nfsclient to allow machines with a VIMAGE
kernel to boot from NFS. [1]

Note: this is not a full virtualization of nfsclient. It is only does
what advertised above and nothing more.

Requested by: public demand [1]
Tested by: kris, ..
MFC after: 5 days


198174 16-Oct-2009 jhb

Close a race with caching of -ve name lookups in the NFS client.
Specifically, clients only trust -ve cache entries while the directory
remains unchanged and discard any -ve cache entries for a directory when
they notice that the modification time of a directory entry changes. The
race involves two concurrent lookups as follows:
- Thread A does a lookup for file 'foo' which sends a lookup RPC to the
server. The lookup fails and the server replies.
- The 'foo' file is created (either by the same client or a different
client) updating the modification time on the parent directory of 'foo'.
- Thread B does a lookup for a different file 'bar' which updates the
cached attributes of the parent directory of 'foo' to reflect the new
modification time after 'foo' was created.
- Thread A finally resumes execution to parse the reply from the NFS
server. It adds a -ve cache entry and sets the cached value of the
directory's modification time that is used for invalidating -ve cached
lookups to the new modification time set by thread B.

At this point, future lookups of 'foo' will honor the -ve cached entry
until the cached entry is pushed out of the name cache's LRU or the
modification time of the parent directory is changed again by some other
change. The fix is to read the directory's modification time before
sending the lookup RPC and use that cached modification time when setting
the directory's cached modification time. Also, we do not add a -ve cache
entry if another thread has added -ve cache entry that set the directory's
cached modification time to a newer value than the value we read before
sending the lookup RPC.

Reviewed by: rmacklem
MFC after: 1 week


197997 12-Oct-2009 rwatson

Add a MODULE_DEPEND() on the NFS client from dtnfsclient so that dtnfsclient
can access NFS client symbols.

MFC after: 3 days
Discussed with: kib
Reported by: markm


197235 15-Sep-2009 qingli

Reverting the previous change for now. Some users reports the patch
fixes their issues but one reports a failure in NFS ROOT. Revert
the change for now pending further investigation.

Reviewed by: bz
MFC after: immediately


197212 15-Sep-2009 qingli

Simply remove the code instead of using "#if 0".

Pointed out by sam


197210 15-Sep-2009 qingli

The bootp code installs an interface address and the nfs client
module tries to install the same address again. This extra code
is removed, which was discovered by the removal of a call to
in_ifscrub() in r196714. This call to in_ifscrub is put back here
because the SIOCAIFADDR command can be used to change the prefix
length of an existing alias.

Reviewed by: kmacy


197048 09-Sep-2009 rmacklem

Add LK_NOWITNESS to the vn_lock() calls done on newly created nfs
vnodes, since these nodes are not linked into the mount queue and,
as such, the vn_lock() cannot cause a deadlock so LORs are harmless.

Suggested by: kib
Approved by: kib (mentor)
MFC after: 3 days


196503 24-Aug-2009 zec

Fix NFS panics with options VIMAGE kernels by apropriately setting curvnet
context inside the RPC code.

Temporarily set td's cred to mount's cred before calling socreate() via
__rpc_nconf2socket().

Submitted by: rmacklem (in part)
Reviewed by: rmacklem, rwatson
Discussed with: dfr, bz
Approved by: re (rwatson), julian (mentor)
MFC after: 3 days


196481 23-Aug-2009 rwatson

Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by: bz, julian
MFC after: 3 days


196205 14-Aug-2009 kib

In nfs_upgrade_vnlock(), assert that the vnode is locked. It is for all
pathes, as far as I see and testing seems to confirm it. Comparision of
old_lock with LK_SHARED make sense only if vnode is locked by current
thread.

When downgrading, pass LK_RETRY to the vn_lock(), since otherwise
vn_lock() unlocks the doomed vnode, causing extra unlock.

Reported and tested by: pho
Approved by: re (rwatson)
MFC after: 3 weeks


196019 01-Aug-2009 rwatson

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


195744 17-Jul-2009 rmacklem

Patch the regular nfs client in a manner analagous to
r195704 for the experimental client. The patch avoids calling vn_lock()
for the case where nfs_nget() has acquired the same vnode as dvp,
since nfs_nget() has already locked the vnode.

Reviewed by: kib, jhb
Approved by: re (kensmith), kib (mentor)


195703 14-Jul-2009 kib

Use PBDRY flag for msleep(9) in NFS and NLM when sleeping thread owns
kernel resources that block other threads, like vnode locks. The SIGSTOP
sent to such thread (process, rather) shall not stop it until thread
releases the resources.

Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)


195699 14-Jul-2009 rwatson

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


195294 02-Jul-2009 kib

In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by: pho
Reviewed by: jeff
Approved by: re (kensmith)
MFC after: 2 weeks


195203 30-Jun-2009 dfr

Adjust the internal NFS KPI to avoid the last traces of NFS_LEGACYRPC.

Approved by: re


195202 30-Jun-2009 dfr

Remove the old kernel RPC implementation and the NFS_LEGACYRPC option.

Approved by: re


195181 30-Jun-2009 jhb

Fix build with NFS_LEGACYRPC enabled after the socket upcall locking
changes.

Approved by: re (kensmith)


194951 25-Jun-2009 rwatson

Add a new global rwlock, in_ifaddr_lock, which will synchronize use of the
in_ifaddrhead and INADDR_HASH address lists.

Previously, these lists were used unsynchronized as they were effectively
never changed in steady state, but we've seen increasing reports of
writer-writer races on very busy VPN servers as core count has gone up
(and similar configurations where address lists change frequently and
concurrently).

For the time being, use rwlocks rather than rmlocks in order to take
advantage of their better lock debugging support. As a result, we don't
enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion
is complete and a performance analysis has been done. This means that one
class of reader-writer races still exists.

MFC after: 6 weeks
Reviewed by: bz


194739 23-Jun-2009 bz

After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.


194425 18-Jun-2009 alc

Fix some of the style errors in *getpages().


194358 17-Jun-2009 kib

For dotdot lookup in nfs_lookup, inline the vn_vget_ino() to prevent
operating on the unmounted mount point and freed mount data in case of
forced unmount performed while dvp is unlocked to nget the target vnode.

Add missed calls to m_freem(mrep) there on error exits [1].

Submitted by: rmacklem [1]
Tested by: pho
MFC after: 2 weeks


194118 13-Jun-2009 jamie

Rename the host-related prison fields to be the same as the host.*
parameters they represent, and the variables they replaced, instead of
abbreviated versions of them.

Approved by: bz (mentor)


193952 10-Jun-2009 rmacklem

Add a test for VI_DOOMED just after nfs_upgrade_vnlock() in
nfs_bioread_check_cons(). This is required since it is possible
for the vnode to be vgonel()'d while in nfs_upgrade_vnlock() when
a forced dismount is in progress. Also, move the check for VI_DOOMED
in nfs_vinvalbuf() down to after nfs_upgrade_vnlock() and replace the
out of date comment for it.

Submitted by: jhb
Tested by: pho
Approved by: kib (mentor)
MFC after: 1 month


193744 08-Jun-2009 bz

After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.


193272 01-Jun-2009 jhb

Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem


193232 01-Jun-2009 bz

Convert the two dimensional array to be malloced and introduce
an accessor function to get the correct rnh pointer back.

Update netstat to get the correct pointer using kvm_read()
as well.

This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.

Reviewed by: julian, rwatson, zec
X-MFC: not possible


193187 31-May-2009 alc

nfs_write() can use the recently introduced vfs_bio_set_valid() instead of
vfs_bio_set_validclean(), thereby avoiding the page queues lock.

Garbage collect vfs_bio_set_validclean(). Nothing uses it any longer.


193066 29-May-2009 jamie

Place hostnames and similar information fully under the prison system.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex. Jails may
have their own host information, or they may inherit it from the
parent/system. The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL. The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.

The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.

Approved by: bz (mentor)


192986 28-May-2009 alc

Make *getpages()s' assertion on the state of each page's dirty bits
stricter.


192686 24-May-2009 dfr

Make sure we feed 32bit align memory to nfsm_dissect otherwise we will fault
on platforms with strict alignment requirements. In particular, this fixes the
problems with the new RPC transport on the arm platform.

Note: this adds yet another copy of nfs_realign(). I will attempt to refactor
after NFS_LEGACYRPC is removed.

Submitted by: sam


192645 23-May-2009 bz

While r192615 fixed the former problems, make this file VIMAGE
compliant now as well initializing local context variables.


192615 23-May-2009 bz

It seems this file was ignored by MRT, rnh locking changes and new-arpv2.
So let the V_irtualization people finally make the disabled debugging code
compile again.

MFC after: 2 weeks
X-MFC: MRT and adapt rnh locking


192578 22-May-2009 rwatson

Remove the unmaintained University of Michigan NFSv4 client from 8.x
prior to 8.0-RELEASE. Rick Macklem's new and more feature-rich NFSv234
client and server are replacing it.

Discussed with: rmacklem


192134 15-May-2009 alc

Eliminate unnecessary clearing of the page's dirty mask from various
getpages functions.

Eliminate a stale comment.


192010 12-May-2009 alc

Eliminate gratuitous clearing of the page's dirty mask.


191990 11-May-2009 attilio

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


191964 10-May-2009 alc

Eliminate stale comments.

Eliminate a case of unnecessary page queues locking.


191816 05-May-2009 zec

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


191777 04-May-2009 rwatson

Remove redundant NFSMNT_NFSV3 check in DTrace hooks for NFS RPC.

MFC after: 1 month


191776 04-May-2009 rwatson

Fix typo in comment.

MFC after: 1 month


191013 13-Apr-2009 kib

Remove trailing spaces


190888 10-Apr-2009 rwatson

Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon


190887 10-Apr-2009 kib

Cache_lookup() for DOTDOT drops dvp vnode lock, allowing dvp to be reclaimed.
Check the condition and return ENOENT then.

In nfs_lookup(), respect ENOENT return from cache_lookup() when it is caused
by dvp reclaim.

Reported and tested by: pho


190785 06-Apr-2009 jhb

When a stale file handle is encountered, purge all cached information about
an NFS node including the access and attribute caches. Previously the NFS
client only purged any name cache entries associated with the file.

PR: kern/123755
Submitted by: Jaakko Heinonen jh of saunalahti fi
Reported by: Timo Sirainen tss of iki fi
Reviewed by: rwatson, rmacklem
MFC after: 1 month


190783 06-Apr-2009 jhb

Change the default timeout for caching attributes of a directory in the NFS
client from 30 seconds to 3 seconds. After the recent changes to add
caching of negative name cache lookups, a negative name cache hit will
persist until the client notices the parent directory has changed. The
higher the attribute cache timeout on directories, the longer that can take,
so lower the default timeout for directories to match that of regular files.

Suggested by: bde, mohans
MFC after: 1 month


190419 25-Mar-2009 rwatson

Move dtnfsclient.c in the cddl tree to nfs_kdtrace.c in the nfsclient
directory, since it's under a BSD license, and this keeps NFS internals-
aware tracing parts close to NFS.

MFC after: 1 month
Suggested by: jhb


190396 24-Mar-2009 rwatson

Fix two bugs in DTrace tracing of accesscache and attrcache load events:

- Trace non-error loads into the access cache once, not zero times or
twice.
- Sometimes attr cache loads fail due to a race, in which case they are
aborted leading to an invalidation; in this case, trace only the flush,
not a load.

MFC after: 1 month
Sponsored by: Google, Inc.


190380 24-Mar-2009 rwatson

Add DTrace probes to the NFS access and attribute caches. Access cache
events are:

nfsclient:accesscache:flush:done
nfsclient:accesscache:get:hit
nfsclient:accesscache:get:miss
nfsclient:accesscache:load:done

They pass the vnode, uid, and requested or loaded access mode (if any);
the load event may also report a load error if the RPC fails.

The attribute cache events are:

nfsclient:attrcache:flush:done
nfsclient:attrcache:get:hit
nfsclient:attrcache:get:miss
nfsclient:attrcache:load:done

They pass the vnode, optionally the vattr if one is present (hit or load),
and in the case of a load event, also a possible RPC error.

MFC after: 1 month
Sponsored by: Google, Inc.


190293 22-Mar-2009 rwatson

Add dtnfsclient, a first cut at an NFSv2/v3 client reuest DTrace
provider. The NFS client exposes 'start' and 'done' probes for NFSv2
and NFSv3 RPCs when using the new RPC implementation, passing in the
vnode, mbuf chain, credential, and NFSv2 or NFSv3 procedure number.
For 'done' probes, the error number is also available.

Probes are named in the following way:

...
nfsclient:nfs2:write:start
nfsclient:nfs2:write:done
...
nfsclient:nfs3:access:start
nfsclient:nfs3:access:done
...

Access to the unmarshalled arguments is not easily available at this
point in the stack, but the passed probe arguments are sufficient to
to a lot of interesting things in practice. Technically, these probes
may cover multiple RPC retransmits, and even transactions if the
transaction ID change as a result of authentication failure or a
jukebox error from the server, but usefully capture the intent of a
single NFS request, such as access, getattr, write, etc.

Typical use might involve profiling RPC latency by system call, number
of RPCs, how often a getattr leads to a call to access, when failed
access control checks occur, etc. More detailed RPC information might
best be provided by adding a krpc provider. It would also be useful
to add NFS client probes for events such as the access cache or
attribute cache satisfying requests without an RPC.

Sponsored by: Google, Inc.
MFC after: 1 month


190220 21-Mar-2009 rwatson

In nfs_request(), always exit using the nfsmout label once we're
definitely doing an NFSv2 or NFSv3 RPC, rather than sometimes doing
so and sometimes not. This makes it easier to add a DTrace return
probe at a single point in the function.

MFC after: 1 week


190176 20-Mar-2009 jhb

Expand the per-node access cache to cache permissions for multiple users.
The number of entries in the cache defaults to 8 but is easily changed in
nfsclient/nfs.h. When the cache is filled, the oldest cache entry is
evicted when space is needed.

I mirrored the changes to the NFSv[23] client in the NFSv4 client to fix
compile breakage. However, the NFSv4 client doesn't actually use the
access cache currently.

Submitted by: rmacklem


189639 10-Mar-2009 jhb

- Remove code to set SAVENAME for CREATE or RENAME requests that get a -ve
hit in the name cache. cache_lookup() doesn't actually return ENOENT
for such requests to force the filesystem to do an explicit lookup, so
this was effectively dead code.
- Grab the nfsnode mutex while writing to n_dmtime. We don't grab the lock
when comparing the time against the cached directory mod time (just as
we don't when comparing ctime's for +ve name cache hits) since the
attribute caching is already racy for NFS clients as it is.

Discussed with: bde


189106 27-Feb-2009 bz

For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.


188994 24-Feb-2009 jhb

Bring back the code to prime the ACCESS cache when fetching attributes for
an NFS file. Now the priming is conditional on a new
vfs.nfs.prime_access_cache sysctl. For now I've left the default setting
to disabling the priming.

Requested by: scottl


188833 19-Feb-2009 jhb

Enable caching of negative pathname lookups in the NFS client. To avoid
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked. If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC. This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.

Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by: mohans


188832 19-Feb-2009 jhb

When fetching attributes for a file for NFSv3 mounts, do not perform an
opportunistic ACCESS RPC to populate both the access and attribute caches
of the file and instead always use a GETATTR RPC. On many modern NFS
servers, an ACCESS RPC is much more expensive to service than a GETATTR
RPC.

Submitted by: mohans


188831 19-Feb-2009 jhb

Don't clear the attribute cache of a file when it is closed. A subsequent
open() of the same file will load fresh attributes, so they do not need to
be explicitly flushed in close() to guarantee close to open consistency.
However, other file desciptors may still reference this file and clearing
the attributes in close() forces those other file descriptors to fetch
fresh attributes the next time they need them.

Reviewed by: mohans
MFC after: 1 week


188751 18-Feb-2009 jhb

Reindent a small bit of code that was not 8-space indented like the rest
of the nfs_lookup() function.


187830 28-Jan-2009 ed

Last step of splitting up minor and unit numbers: remove minor().

Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().

We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.

Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.

I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.


187812 28-Jan-2009 rodrigc

Fix parsing of acregmin, acregmax, acdirmin and acdirmax NFS mount options
when passed as strings via nmount().

Submitted by: Jaakko Heinonen <jh saunalahti fi>


187526 21-Jan-2009 jhb

Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the
feature.

Inspired by: ups
Tested by: pho


185571 02-Dec-2008 bz

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


184969 14-Nov-2008 dfr

Switch the default rpc implementation for NFS back to the new code. I believe
I have fixed the reported problems - if you still have trouble with it, please
contact me with as much detail as possible so that I can track down any other
issues as quickly as possible.


184920 13-Nov-2008 dfr

Temporarily switch NFS back to the old RPC code while I try to diagnose and
fix the problems a few people have noticed with the new code. People who want
to continue testing the new code or who need RPCSEC_GSS support should use
the new option NFS_NEWRPC to select it.


184588 03-Nov-2008 dfr

Implement support for RPCSEC_GSS authentication to both the NFS client
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager. I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.

The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.

To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.

As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.

Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.

The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.

Sponsored by: Isilon Systems
MFC after: 1 month


184561 02-Nov-2008 trhodes

Document a few sysctls in the NFS client and server code.
Minor style(9) where applicable.

Approved by: alfred (slightly older version)


184554 02-Nov-2008 attilio

Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with: kib
Tested by: pho


184413 28-Oct-2008 trasz

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


184214 23-Oct-2008 des

Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.


184205 23-Oct-2008 des

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


183754 10-Oct-2008 attilio

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


183550 02-Oct-2008 zec

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


183330 24-Sep-2008 jhb

Part 1 of making shared lookups more resilient with respect to forced
unmounts. When we upgrade a vnode lock from shared to exclusive during
a name cache lookup, fail the lookup with EBADF if the vnode is invalidated
while we are waiting for the exclusive lock.

Also, for correctness (though I'm not sure it can occur in practice),
downgrade an exclusively locked vnode if it should be share locked.

Tested by: pho


183215 20-Sep-2008 kib

fdescfs, devfs, mqueuefs, nfs, portalfs, pseudofs, tmpfs and xfs
initialize the vattr structure in VOP_GETATTR() with VATTR_NULL(),
vattr_null() or by zeroing it. Remove these to allow preinitialization
of fields work in vn_stat(). This is needed to get birthtime initialized
correctly.

Submitted by: Jaakko Heinonen <jh saunalahti fi>
Discussed on: freebsd-fs
MFC after: 1 month


183005 13-Sep-2008 rodrigc

Add code to parse NFS mount options passed as individual
items of the nmount() iovec. This will allow us to move
away from gathering up all the NFS mount options as a single
"struct nfs_args" to be passed down through nmount().
This will make adding new NFS mount options much easier.
Many, many thanks to Doug Rabson, who took my initial patches and
cleaned them up.

Reviewed by: dfr
MFC after: 3 months


182542 31-Aug-2008 attilio

Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.

Manpages are updated accordingly.

Tested by: Diego Sardina <siarodx at gmail dot com>


182371 28-Aug-2008 attilio

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


181803 17-Aug-2008 bz

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


180780 24-Jul-2008 dfr

Try again not to use a userspace pointer in the kernel when trying to record
the hostname which we need for NLM requests. The previous patch was incomplete.

PR: 125849
Pointy hat: dfr


180779 24-Jul-2008 dfr

Don't use a userspace pointer in the kernel when trying to record the hostname
which we need for NLM requests.

PR: 125849


180724 22-Jul-2008 ed

Move the NFS/RPC code away from lbolt.

The kernel has a special wchan called `lbolt', which is triggered each
second. It doesn't seem to be used a lot and it seems pretty redundant,
because we can specify a timeout value to the *sleep() routines. In an
attempt to eventually remove lbolt, make the NFS/RPC code use a timeout
of `hz' when trying to reconnect.

Only the TTY code (not MPSAFE TTY) and the VFS syncer seem to use lbolt
now.

Reviewed by: attilio, jhb
Approved by: philip (mentor), alfred, dfr


180291 05-Jul-2008 rwatson

Introduce a new lock, hostname_mtx, and use it to synchronize access
to global hostname and domainname variables. Where necessary, copy
to or from a stack-local buffer before performing copyin() or
copyout(). A few uses, such as in cd9660 and daemon_saver, remain
under-synchronized and will require further updates.

Correct a bug in which a failed copyin() of domainname would leave
domainname potentially corrupted.

MFC after: 3 weeks


180025 26-Jun-2008 dfr

Re-implement the client side of rpc.lockd in the kernel. This implementation
provides the correct semantics for flock(2) style locks which are used by the
lockf(1) command line tool and the pidfile(3) library. It also implements
recovery from server restarts and ensures that dirty cache blocks are written
to the server before obtaining locks (allowing multiple clients to use file
locking to safely share data).

Sponsored by: Isilon Systems
PR: 94256
MFC after: 2 weeks


179333 27-May-2008 attilio

Once the ENOLCK is detected we expect to retry the acquisition.
Anyway, in the edge case the flushing happens and the while is no more
executed, nfs_flush() (and nfs4_flush()) can return with a wrong
err value of ENOLCK.
Bring it back to 0, as we expect to have for that case.

Reported by: kris
Reviewed by: kib


179039 16-May-2008 benno

Allow the block size used when booting over NFS to be overridden. It defaults
to 8192 bytes which is the size currently used.


178888 09-May-2008 julian

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


178429 22-Apr-2008 phk

Now that all platforms use genclock, shuffle things around slightly
for better structure.

Much of this is related to <sys/clock.h>, which should really have
been called <sys/calendar.h>, but unless and until we need the name,
the repocopy can wait.

In general the kernel does not know about minutes, hours, days,
timezones, daylight savings time, leap-years and such. All that
is theoretically a matter for userland only.

Parts of kernel code does however care: badly designed filesystems
store timestamps in local time and RTC chips almost universally
track time in a YY-MM-DD HH:MM:SS format, and sometimes in local
timezone instead of UTC. For this we have <sys/clock.h>

<sys/time.h> on the other hand, deals with time_t, timeval, timespec
and so on. These know only seconds and fractions thereof.

Move inittodr() and resettodr() prototypes to <sys/time.h>.
Retain the names as it is one of the few surviving PDP/VAX references.

Move startrtclock() to <machine/clock.h> on relevant platforms, it
is a MD call between machdep.c/clock.c. Remove references to it
elsewhere.

Remove a lot of unnecessary <sys/clock.h> includes.

Move the machdep.disable_rtc_set sysctl to subr_rtc.c where it belongs.
XXX: should be kern.disable_rtc_set really, it's not MD.


178243 16-Apr-2008 kib

Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks


177633 26-Mar-2008 dfr

Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.

* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.

Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks


177599 25-Mar-2008 ru

Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by: arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.


177493 22-Mar-2008 jeff

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


176824 05-Mar-2008 rodrigc

Expand the nfs_opts array to include all possible string
mount options that mount_nfs could pass down, if it passed
down string mount options. Right now, mount_nfs jut passes
down a single mount option named "nfs_args" with a fully
initialized 'struct nfs_args'.

In future commits, we will add code to the kernel for parsing stringified
NFS mount options, so that we can convert mount_nfs to pass string options
from userspace to kernel, instead of an initialized struct nfs_args.


176823 05-Mar-2008 rodrigc

In nfs_mount(), default initialize struct nfs_args
the same way that it is default initialized in revision 1.77 of mount_nfs.c.

Right now, this is a no-op, because currently we initialize
struct nfs_args in mount_nfs in userspace, and pass it
down into the kernel via nmount(), so we overwrite whatever we initialize
here with the value passed in from userspace.

However, this lays the groundwork for moving away from passing
struct nfs_args from userspace to kernel via nmount(), so that we
can instead pass string mount options via nmount() which can be parsed in
the kernel. This will make it easier to add new NFS mount options.


176559 25-Feb-2008 attilio

Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by: Andrea Barberio <insomniac at slackware dot it>


176519 24-Feb-2008 attilio

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


176374 17-Feb-2008 yar

Prevent the NFS client from losing MNT_ROOTFS on the root
file system. In particular, stop overwriting mount point
flags in nfs_mountdiskless() because now they are set
elsewhere. (They were _initialized_ by that function in
the 4.4BSD days, when mount structures were not allocated
in a centralized manner -- see rev. 1.1 of this file.)

Fix nfs_mount(), which happened to depend on the loss of
MNT_ROOTFS when it came to update handling.

Also note that mountnfs() no longer handles updates. Now
they shouldn't reach this function, so printf a diagnostic
message if that happens due to a coding error.


176249 13-Feb-2008 attilio

- Add real assertions to lockmgr locking primitives.
A couple of notes for this:
* WITNESS support, when enabled, is only used for shared locks in order
to avoid problems with the "disowned" locks
* KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order
to assert for a generic thread (not curthread) owning or not the
lock. Really, this kind of check is bogus but it seems very
widespread in the consumers code. So, for the moment, we cater this
untrusted behaviour, until the consumers are not fixed and the
options could be removed (hopefully during 8.0-CURRENT lifecycle)
* Implementing KA_HELD and KA_UNHELD (not surported natively by
WITNESS) made necessary the introduction of LA_MASKASSERT which
specifies the range for default lock assertion flags
* About other aspects, lockmgr_assert() follows exactly what other
locking primitives offer about this operation.

- Build real assertions for buffer cache locks on the top of
lockmgr_assert(). They can be used with the BUF_ASSERT_*(bp)
paradigm.

- Add checks at lock destruction time and use a cookie for verifying
lock integrity at any operation.

- Redefine BUF_LOCKFREE() in order to not use a direct assert but
let it rely on the aforementioned destruction time check.

KPI results evidently broken, so __FreeBSD_version bumping and
manpage update result necessary and will be committed soon.

Side note: lockmgr_assert() will be used soon in order to implement
real assertions in the vnode namespace replacing the legacy and still
bogus "VOP_ISLOCKED()" way.

Tested by: kris (earlier version)
Reviewed by: jhb


176224 13-Feb-2008 jhb

Consolidate the code to generate a new XID for a NFS request into a
nfs_xid_gen() function instead of duplicating the logic in both
nfsm_rpchead() and the NFS3ERR_JUKEBOX handling in nfs_request().

MFC after: 1 week
Submitted by: mohans (a long while ago)


176198 11-Feb-2008 kris

Switch the default NFS mount mode from UDP to TCP. UDP mounts are a
historical relic, and are no longer appropriate for either LAN or WAN
mounting. At modern (gigabit and 10 gigabit) LAN speeds packet loss
from socket buffer fill events is common, and sequence numbers wrap
quickly enough that data corruption is possible. TCP solves both of
these problems without imposing significant overhead.

MFC after: 1 month


176134 09-Feb-2008 attilio

namei() can call underlying nfs_readlink() passing a struct uio pointer
owned by a NULL owner. This will lead consequent VOP_ISLOCKED() present
into nfs_upgrade_vnlock() to panic as it only acquire curthread now.
Fix nfs_upgrade_vnlock() and nfs_downgrade_vnlock() in order to not use
more the struct thread pointer passed as argument (as it is really nomore
required there as vn_lock() and VOP_UNLOCK doesn't get the lock more).
Using curthread, in place, doesn't get ambiguity as LK_EXCLOTHER should
be handled as a "not locked" request by both functions.

Reported by: kris
Tested by: kris
Reviewed by: ups


176116 08-Feb-2008 attilio

Conver all explicit instances to VOP_ISLOCKED(arg, NULL) into
VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should
only acquire curthread as argument; this will lead in axing the additional
argument from both functions, making the code cleaner.

Reviewed by: jeff, kib


175635 24-Jan-2008 attilio

Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo


175486 19-Jan-2008 attilio

- Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)


175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


175237 11-Jan-2008 jhb

The previous revision broke the case of reconnecting to a TCP NFS server
via a new socket during an NFS operation as that reconnect takes place in
the context of an arbitrary thread with an arbitrary credential. Ideally
we would like to use the mount point's credential for the entire process
of setting up the socket to connect to the NFS server. Since some of the
APIs (sobind(), etc.) only take a thread pointer and infer the credential
from that instead of a direct credential, work around the problem by
temporarily changing the current thread's credential to that of the mount
point while connecting the socket and then reverting back to the original
credential when we are done.

Reviewed by: rwatson
Tested on: UDP, TCP, TCP with forced reconnect


175221 10-Jan-2008 jhb

Pass curthread to various socket routines (socreate(), sobind(), and
soconnect()) instead of &thread0 when establishing a connection to the NFS
server. Otherwise inconsistent credentials may be used when setting up
the NFS socket.

MFC after: 1 week
Reviewed by: rwatson


175202 10-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


173751 19-Nov-2007 rwatson

Remove hacks from the NFSv2/3 client intended to handle a lack of a
server-side RPC retranmission cache for non-idempotent operations: these
hacks substituted 0 (success) for the expected EEXIST in the event that
a target name already existed for LINK, SYMLINK, and MKDIR operations,
under the assumption that EEXIST represented a second application of the
original RPC rather than a true failure.

Background: certain NFS operations (in this case, LINK, SYMLINK, and
MKDIR) are not idempotent, as they leave behind persisting state on the
server that prevents them from being replayed without an error;if an UDP
RPC reply is lost leading to a retransmission by theclient, the second
reply will return EEXIST rather than success, asthe new object has
already been created. The NFS client previouslysilently mapped the
EEXIST return into success to paper over thisproblem.

However, in all modern NFS server implementations, a reply cache is kept
in order to retransmit the original reply to a retransmitted request,
rather than performing the operation a second time, allowing this hack
to be avoided. This allows link()-based filelocking over NFS to operate
correctly, as an application requestingthe creation of a new link for a
file to tell if it succeededatomically or not.

Other NFS clients, including Solaris and Linux, generally follow this
behavior for the same reasons. Most clients also now default to TCP,
which also helps avoid the issue of retransmitted but non-idempotent
requests in most cases.

Reported by: Adam McDougall <mcdouga9 at egr dot msu dot edu>,
Timo Sirainen <tss at iki dot fi>
Reviewed by: mohans
MFC after: 1 week


173068 27-Oct-2007 rodrigc

Add the following mount options to the nfs_opts array:
noatime, noexec, suiddir, nosuid, nosymfollow, union,
noclusterr, noclusterw, multilabel, acls, force, update,
async. These options correspond to MOPT_STDOPTS, MOPT_FORCE, MOPT_UPDATE,
and MOPT_ASYNC.

Currently, mount_nfs converts these "-o" options from strings
to MNT_ flags via getmntopts(),
and passes the flags from userspace to the kernel.
This change will allow us in future to pass these mount options
as strings directly to the kernel via nmount() when doing NFS mounts.


172836 20-Oct-2007 julian

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


172759 18-Oct-2007 jhb

Add a -z flag to nfsstat which zeros the NFS statistics after displaying
them.

MFC after: 1 week
Requested by: ps
Submitted by: ps (6 years ago)


172697 16-Oct-2007 alfred

Get rid of qaddr_t.

Requested by: bde


172600 12-Oct-2007 mohans

NFS MP scaling changes.
- Eliminate the hideous nfs_sndlock that serialized NFS/TCP request senders
thru the sndlock.
- Institute a new nfs_connectlock that serializes NFS/TCP reconnects. Add
logic to wait for pending request senders to finish sending before
reconnecting. Dial down the sb_timeo for NFS/TCP sockets to 1 sec.
- Break out the nfs xid manipulation under a new nfs xid lock, rather than
over loading the nfs request lock for this purpose.
- Fix some of the locking in nfs_request.
Many thanks to Kris Kennaway for his help with this and for initiating the
MP scaling analysis and work. Kris also tested this patch thorougly.
Approved by: re@ (Ken Smith)


172324 25-Sep-2007 mohans

Fix for a very rare race, caused by the nfsiod wakeup and nfsiod idle
timeout occurring at exactly the same time. If this happens, the nfsiod
exits although there may be a queued async IO request for it.

Found by : Kris Kennaway
Approved by: re


171744 06-Aug-2007 rwatson

Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.

Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)


171190 03-Jul-2007 jhb

Fix for a race where out of order loading of NFS attrs into the
nfsnode could lead to attrs being stale. One example (that we
ran into) was a READDIR+, WRITE. The responses came back in
order, but the attrs from the WRITE were loaded before the
attrs from the READDIR+, leading to the wrong size from being
read on the next stat() call.

MFC after: 1 week
Submitted by: mohans
Approved by: re (kensmith)


171189 03-Jul-2007 jhb

Fix up NFS client write error handling. Errors are split into
recoverable and unrecoverable. For the former, we redirty the
buffer and hang onto it for future retries. For the latter (eg.
ESTALE), we discard the buffer and return the error back to the
user on the next syscall. This fixes a number of vfs panics and
fixes having a large number of dirty buffers (that cannot be
written out and reclaimed) from hanging around. Thanks to ups@
for discussions on this issue.

Reported by: kris, Kai, others
Approved by: re (kensmith)


170292 04-Jun-2007 attilio

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


170174 01-Jun-2007 jeff

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


170170 31-May-2007 attilio

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


169681 18-May-2007 rwatson

In nfs_down(), if rep can be NULL, which we test for, then we should
lock and unlock conditionally, not just set the flag on it conditionally.
In practice, this bug couldn't manifest, as in the current revision of
the code, no callers pass a NULL rep.

CID: 1416
Found with: Coverity Prevent(tm)


169667 18-May-2007 jeff

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


169043 25-Apr-2007 jhb

Various fixes to the NFS Directio support.
- Fix for a bug where a close would not wait for all (directio)
dirty buffers to drain. The nfsnode was not marked NMODIFIED
when there were directio dirtied buffers pending, causing this.
- No reason to vhold/vrele the vp when enqueueing DirectIO requests
for the nfsiods. The vnode can't really go way since the close
has to wait for these requests to drain.

MFC after: 1 week
Submitted by: mohans


168931 21-Apr-2007 rwatson

Attempt to rationalize NFS privileges:

- Replace PRIV_NFSD with PRIV_NFS_DAEMON, add PRIV_NFS_LOCKD.

- Use PRIV_NFS_DAEMON in the NFS server.

- In the NFS client, move the privilege check from nfslockdans(), which
occurs every time a write is performed on /dev/nfslock, and instead do it
in nfslock_open() just once. This allows us to avoid checking the saved
uid for root, and just use the effective on open. Use PRIV_NFS_LOCKD.


167830 23-Mar-2007 delphij

Don't destroy a mutex just before we use it, instead,
destroy it after we have used it.


167497 13-Mar-2007 tegge

Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount). On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib


167353 09-Mar-2007 mohans

Back out a chance to nfs_timer() that inadvertantly crept in the last checkin :(


167352 09-Mar-2007 mohans

Over NFS, an open() call could result in multiple over-the-wire
GETATTRs being generated - one from lookup()/namei() and the other
from nfs_open() (for cto consistency). This change eliminates the
GETATTR in nfs_open() if an otw GETATTR was done from the namei()
path. Instead of extending the vop interface, we timestamp each attr
load, and use this to detect whether a GETATTR was done from namei()
for this syscall. Introduces a thread-local variable that counts the
syscalls made by the thread and uses <pid, tid, thread syscalls> as
the attrload timestamp. Thanks to jhb@ and peter@ for a discussion on
thread state that could be used as the timestamp with minimal overhead.


167086 27-Feb-2007 jhb

Use pause() rather than tsleep() on stack variables and function pointers.


166782 16-Feb-2007 mohans

Backing out an earlier change. It seems harmless for NFS to miss the "force
unmount" flag, making the acquisition of the MNT_ILOCK in nfs_request() and
nfs_sigintr() unnecessary. Pointed out by tegge@.


166636 11-Feb-2007 mohans

Add missing MNT_ILOCK around some mnt_kern_flag accesses.


166378 31-Jan-2007 mohans

Fix for a vnode lock leak in nfs_create() in the event of an error.
Spotted by ups@.


166338 30-Jan-2007 kris

Instead of always hard-coding the socket type for the nfs root mount as
SOCK_DGRAM (i.e. UDP), respect the value configured earlier. This allows
TCP NFS root mounts using e.g. the boot.nfsroot.options="tcp" tunable.

In this case some of the connection parameters like the retry timer were
previously set appropriately for TCP but inappropriately for the UDP
socket that was actually used, leading to e.g. extremely long recovery
times (O(hours)) after a nfs server reboot.

Reviewed by: mohans
MFC After: 2 weeks


166218 25-Jan-2007 bde

Unstaticize nfs_iosize() in nfsclient and use it in nfs4client instead
of duplicating it except for larger style bugs in the copy.

Fix some nearby style bugs (including a harmless type mismatch)
in and near the remaining copy.

This is part of fixing collisions of the 2 nfs*client's names. Even
static names should have a unique prefixes so that they can be debugged
easily.


166193 23-Jan-2007 kib

Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)


165104 11-Dec-2006 mohans

NetApp filers return corrupt post op attrs in the wcc on NFS error responses.
This is easy to reproduce for EROFS. I am not sure if the attrs can be corrupt
for other NFS error responses. For now, disabling wcc pre-op attr checks and
post-op attr loads on NFS errors (sysctl'ed).
Reported by: Kris Kennaway


164934 06-Dec-2006 sam

consolidate parsing of nfs root mount options in one place
and handle all options (some may require fixes elsewhere)

Reviewed by: jhb, mohans
MFC after: 1 month


164735 29-Nov-2006 mohans

In nfs_nget(), we must initialize the fh in the nfsnode before inserting the
vnode into the vfs hash. Otherwise, another thread walking the hash can trip
on an nfsnode with an uninitialized or partially initialized fh.
Thanks to ups@ for spotting this race.


164701 27-Nov-2006 mohans

bde@ pointed out that tprintf() acquires Giant so callers of tprintf() don't
have to explicitly acquire Giant (although they need to be aware of this and
not hold any locks at that point). Remove the acquisitions of Giant in the
NFS client wrapping tprintf().


164684 27-Nov-2006 mohans

Fix for a bug caused by a race when 2 threads lookup the same
file. Leave the loser's lock(s) initialized, so the reclaim logic can
unconditionally destroy them when that race occurs (or if the vfs hash
insert happened to fail for some other reason). Thanks to ups@ for a
careful review of the code.
Reported by : Kris Kennaway


164430 20-Nov-2006 mohans

1) Fix up locking in nfs_up() and nfs_down.
2) Reduce the acquisitions of the Giant lock in the nfs_socket.c paths significantly.
- We don't need to acquire Giant before tsleeping on lbolt anymore,
since jhb specialcased lbolt handling in msleep.
- nfs_up() needs to acquire Giant only if printing the "server up"
message.
- nfs_timer() held Giant for the duration of the NFS timer processing,
just because the printing of the message in nfs_down() needed it
(and we acquire other locks in nfs_timer()). The acquisition of
Giant is moved down into nfs_down() now, reducing the time Giant is
held in that path.

Reported by: Kris Kennaway


164346 16-Nov-2006 mohans

vfs_hash_insert() vputs() the losing vnode before returning, in the event of
a race where a duplicate vnode is entered into the vfs hash. nfs_nget() shouldn't
be releasing the vnode in that case.


164345 16-Nov-2006 mohans

Fix to readdir+ reply handling. When inserting an entry into the namecache,
initialize the nfsnode's ctime. Otherwise a subsequent lookup purges the
just entered namecache entry.


164063 07-Nov-2006 sam

honor nolockd flag in root mount options

MFC after: 2 weeks


163830 31-Oct-2006 mohans

Make EWOULDBLOCK a recoverable error so that the request is retransmitted.
This bug results in data corruption with NFS/TCP. Writes are silently dropped
on EWOULDBLOCK (because socket send buffer is full and sockbuf timer fires).

Reviewed by: ups@


163471 17-Oct-2006 bde

Fixed some style bugs (especially ones involving long lines and use
of __P(())). There are many more.


163341 14-Oct-2006 bde

Don't do null Setattr RPCs for VA_MARK_ATIME. When we added the
VA_MARK_ATIME feature to fix POSIX conformance fore execve() and mmap(),
we thought that it was optimized well enough for the one file system
that supports it (ffs) and harmless for other file systems (except
layered ones which already get the layering for VOP_SETATTR() wrong).
However, nfs_setattr() doesn't do much parameter checking, so when
it gets a combination of parameters that it doesn't understand, it
always does a Setattr RPC. This RPC can't do anything good, and for
VA_MARK_ATIME it is null except for wasting a lot of time.

This is the smallest and easiest to fix of several bugs that have
increased the number of RPCs for kernel builds on nfs by more than
100% since 2004-11-05. The real-time increase depends on network
latency and parallelization and can also be very large (approaching
the same percentage for unparallelized operations like "make depend"
on systems with fast CPUs and high-latency networks).


162954 02-Oct-2006 phk

First part of a little cleanup in the calendar/timezone/RTC handling.

Move relevant variables to <sys/clock.h> and fix #includes as necessary.

Use libkern's much more time- & spamce-efficient BCD routines.


162649 26-Sep-2006 tegge

Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.


162647 26-Sep-2006 tegge

Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().


162288 13-Sep-2006 mohans

Fixes up the handling of shared vnode lock lookups in the NFS client,
adds a FS type specific flag indicating that the FS supports shared
vnode lock lookups, adds some logic in vfs_lookup.c to test this flag
and set lock flags appropriately.

- amd on 6.x is a non-starter (without this change). Using amd under
heavy load results in a deadlock (with cascading vnode locks all the
way to the root) very quickly.
- This change should also fix the more general problem of cascading
vnode deadlocks when an NFS server goes down.

Ideally, we wouldn't need these changes, as enabling shared vnode lock
lookups globally would work. Unfortunately, UFS, for example isn't
ready for shared vnode lock lookups, crashing pretty quickly.

This change is the result of discussions with Stephan Uphoff (ups@).

Reviewed by: ups@


161722 29-Aug-2006 mohans

Fix for a deadlock triggered by a 'umount -f' causing a NFS request to never
retransmit (or return). Thanks to John Baldwin for helping nail this one.

Found by : Kris Kennaway


161371 16-Aug-2006 thomas

Fix typos in comment.


161125 09-Aug-2006 alc

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


161109 09-Aug-2006 brooks

Add a new kernel environment variable "boot.netif.mtu" which is used to
set the MTU prior to mounting root via NFS. This is required if the
server supports a higher than default MTU because the client will not
see the responses otherwise.

MFC after: 3 weeks


160619 24-Jul-2006 rwatson

soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod: sam, gnn, wollman


160182 08-Jul-2006 kib

Signals may be delivered to process as well as to the thread. Check the
thread-delivered signals in addition to the process one.

Reviewed by: mohan
MFC after: 1 month
Approved by: kan (mentor)


160181 08-Jul-2006 kib

Always supply curthread as argument to nfs_asyncio and nfs_doio
in nfs_strategy. Otherwise, for some buffers, signals would be ignored
at the intr mounts.

Reviewed by: mohan
MFC after: 1 month
Approved by: kan (mentor)


160038 29-Jun-2006 yar

There is a consensus that ifaddr.ifa_addr should never be NULL,
except in places dealing with ifaddr creation or destruction; and
in such special places incomplete ifaddrs should never be linked
to system-wide data structures. Therefore we can eliminate all the
superfluous checks for "ifa->ifa_addr != NULL" and get ready
to the system crashing honestly instead of masking possible bugs.

Suggested by: glebius, jhb, ru


160029 29-Jun-2006 yar

Use the elegant TAILQ_FOREACH() in place of a hand-rolled for() loop.


159081 30-May-2006 mohans

Kris Kennaway found that for '/' NFS mounts, the MPSAFE mount flag was
not being set, which means Giant would be acquired for these mounts.


158965 26-May-2006 mohans

Fix for a potential attempt to sleep while holding nm_mtx. Caught and reported
by Witness (which forces the mbuf allocation flag to M_NOWAIT).

Reported by: "sekes".


158915 25-May-2006 ups

Call vm_object_page_clean() with the object lock held.

Submitted by: kensmith@
Reviewed by: mohans@
MFC after: 6 days


158906 25-May-2006 ups

Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by: tegge@
MFC after: 7 days


158905 24-May-2006 mohans

Since NFSv4 is not SMP safe, nfsiod needs to acquire Giant for NFSv4 mounts
before doing the read/write.

Reported by: Chuck Lever.


158903 24-May-2006 rwatson

Adjust minimum iod threads from 4 to 0 -- since we compile the NFS
client into the kernel by default, and many users won't use NFS,
don't start an extra 4 kernel threads that are unused. Once NFS
becomes active, it will start nfsiod's as it needs them.

We might consider mandating a minimum iod's equal to the number of
active NFS mounts (truncated to some value), which would force some
to remain available without having to create a new one if the file
system is mostly inactive.

PR: 70880
MFC after: 2 weeks
Prodded by: cel
Head nod: peter
Pointed out by: Joe <fbsd_user at a1poweruser dot com>


158860 23-May-2006 cel

NFS over TCP retransmit behavior should default to a 60 second time out,
mimicing the NFS reference implementation.

NFS over TCP does not need fast retransmit timeouts, since network loss
and congestion are managed by the transport (TCP), unlike with NFS over
UDP. A long timeout prevents the unnecessary retransmission of non-
idempotent NFS requests.

Reviewed by: mohans, silby, rees?
Sponsored by: Network Appliance, Incorporated


158859 23-May-2006 cel

Refactor the NFS over UDP retransmit timeout estimation logic to allow
the estimator to be more easily tuned and maintained.

There should be no functional change except there is now a lower limit
on the retransmit timeout to prevent the client from retransmitting
faster than the server's disks can fill requests, and an upper limit
to prevent the estimator from taking to long to retransmit during a
server outage.

Reviewed by: mohan, kris, silby
Sponsored by: Network Appliance, Incorporated


158855 23-May-2006 mohans

Vnode locks are recursive and the NFS client support shared vnode locks.

Found by: Kris Kennaway.


158739 19-May-2006 mohans

Changes to make the NFS client MP safe.

Thanks to Kris Kennaway for testing and sending lots of bugs my way.


158316 05-May-2006 mohans

Fix a snafu caused while patching the previous fix from another branch.


158315 05-May-2006 mohans

Fix for a NFS/TCP client bug which would cause the NFS/TCP stream to get
out of sync under heavy loads, forcing frequent reconnets, causing EBADRPC
errors etc.


157557 06-Apr-2006 mohans

Keep track of the number of in-progress async direct IO writes in the nfsnode.
Make fsync/close wait until all of these drain. Add a check to nfs_getpage() and
nfs_putpage().


157349 01-Apr-2006 jeff

- Busy the filesystem in nfs_statfs to prevent us from creating a new
vnode after vflush() has succeeded. This would cause a dangling vnode
panic at unmount time otherwise. Other filesystems may have this problem
via their VFS_VGET() routines.

Found by: kris
Sponsored by: Isilon Systems, Inc.


157058 23-Mar-2006 kris

Fix a bug in the NFS/TCP retransmission path.

The bug was that earlier, if a request was retransmitted,
we would do subsequent retransmits every 10 msecs.

This can cause data corruption under moderate loads by reordering
operations as seen by the client NFS attribute cache, and on the
server side when the retransmission occurs after the original request
has left the duplicate cache, since the operation will be committed
for a second time.

Further work on retransmission handling is needed (e.g. they are still
being done sent too often since they are scaled by HZ, and the size of
the dup cache is too small and easily overwhelmed on busy servers).

Submitted by: mohans


156879 19-Mar-2006 pjd

Actually I wanted 'nolockd' here instead of 'lockd'.

MFC after: 2 days


156825 17-Mar-2006 cel

If an NFS server returns more than a few EJUKEBOX errors for a given RPC
request, the FreeBSD NFS client will quickly back off to a excessively
long wait (days, then weeks) before retrying the request.

Change the behavior of the FreeBSD NFS client to match the behavior of
the reference NFS client implementation (Solaris). This provides a fixed
delay of 10 seconds between each retry by default. A sysctl, called
nfs3_jukebox_delay, is now available to tune the delay. Unlike Solaris,
the sysctl value on FreeBSD is in seconds, rather than in HZ.

Sponsored by: Network Appliance, Incorporated
Reviewed by: rick
Approved by: silby
MFC after: 3 days


156416 08-Mar-2006 cel

Fix a bug in NFSv3 READDIRPLUS reply processing

The client's READDIRPLUS logic skips the attributes and
filehandle of the ".." entry. If the server doesn't send
attributes but does send a filehandle for "..", the
client's logic doesn't account for the extra "value
follows" field that indicates whether the filehandle is
present, causing the remaining entries in the reply
to be ignored.

Sponsored by: Network Appliance, Inc.
Reviewed by: rick, mohans
Approved by: silby
MFC after: 2 weeks


154580 20-Jan-2006 rees

Don't log an error on tcp connection reset, even if we don't get ECONNRESET.

Submitted by: cel@citi.umich.edu


154487 17-Jan-2006 alfred

I ran into an nfs client panic a couple of times in a row over the
last few days. I tracked it down to the fact that nfs_reclaim()
is setting vp->v_data to NULL _before_ calling vnode_destroy_object().
After silence from the mailing list I checked further and discovered
that ufs_reclaim() is unique among FreeBSD filesystems for calling
vnode_destroy_object() early, long before tossing v_data or much
of anything else, for that matter. The rest, including NFS, appear
to be identical, as if they were just clones of one original routine.

The enclosed patch fixes all file systems in essentially the same
way, by moving the call to vnode_destroy_object() to early in the
routine (before the call to vfs_hash_remove(), if any). I have
only tested NFS, but I've now run for over eighteen hours with the
patch where I wouldn't get past four or five without it.

Submitted by: Frank Mayhar
Requested by: Mohan Srinivasan
MFC After: 1 week


154316 13-Jan-2006 rwatson

In nfs_dolock(), GC now under-used ioflg, rendered obsolete when we moved
from using a fifo to talk to rpc.lockd to using a special device node.

Noticed by: Coverity Prevent analysis tool
MFC after: 3 days


154152 09-Jan-2006 tegge

Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by: truckman


153786 28-Dec-2005 delphij

Correct a typo


153365 12-Dec-2005 ps

Improve upon rev 1.133 where NFS/TCP would not reconnect.

Submitted by: Mohan Srinivasan


152920 29-Nov-2005 ru

Unexpand LLADDR().


152657 21-Nov-2005 ps

Fix for a bug where NFS/TCP would not reconnect (in the case where
the server FIN'ed). Seen with Solaris NFS servers.

Reported by: TOMITA Yoshinori <yoshint@flab.fujitsu.co.jp>
Submitted by: Mohan Strinivasan


152656 21-Nov-2005 ps

- Always return success from NFS strategy. nfs_doio(), in the
event of an error, does the right thing, in terms of setting
the error flags in the buf header. That fixes a crash from
bstrategy().
- Treat ETIMEDOUT as a "recoverable" error, causing the buffer
to be re-dirtied. ETIMEDOUT can occur on soft mounts, when
the number of retries are exceeded, and we don't want data loss
in that case.

Submitted by: Mohan Srinivasan


152652 21-Nov-2005 rees

fix a problem with XID re-use when a server returns NFSERR_JUKEBOX.

Submitted by: cel@citi.umich.edu
Fixed by: rick@snowhite.cis.uoguelph.ca
Approved by: alfred
MFC after: 3 weeks


152289 10-Nov-2005 jon

fix a crash when an nfsv2 mount fails

MFC after: 1 week


152019 03-Nov-2005 ps

Fix for a crash (from nfs_lookup() in an error case).

Submitted by: Mohan Srinivasan


152000 03-Nov-2005 ps

In nfs_flush(), clear the NMODIFIED bit only if there are no dirty
buffers *and* there are no buffers queued up for writing. The bug
was that NMODIFIED was being cleared even while there were buffers
scheduled to be written out, which leads to all sorts of interesting
bugs - one where the file could shrink (because of a post-op getattr
load, say) causing data in buffer(s) queued for write to be tossed,
resulting in data corruption.

Submitted by: Mohan Srinivasan


151998 03-Nov-2005 ps

Fix for a race between the thread transmitting the request and the
thread processing the reply.

Submitted by: Mohan Srinivasan


151897 31-Oct-2005 rwatson

Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.


151695 26-Oct-2005 glebius

- Fix leak of struct nlminfo on process exit.
- Fix malloc type collision, that made the above problem
difficult to understand.

Reported by: Vladimir Sharun <sharun ukr.net>


151024 06-Oct-2005 pjd

- Use strsep() instead of strtok().
- strdup() uses M_WAITOK, so we don't need to check it's return value
against NULL.

MFC after: 2 weeks


150995 06-Oct-2005 pjd

Add boot.nfsroot.options loader tunable.
It allows to specify options for NFS root file system.
Currently supported options are: soft, intr, conn, lockd.

I'm adding this functionality mostly for 'lockd' option, which is only
honored when performing the initial mount and will be silently ignored
if used while updating the mount options.

This will allow to use flock(2) without the need of using varmfs or
rpc.lockd and friends.

Example of use:
boot.nfsroot.options="intr,lockd"

MFC after: 2 weeks


150335 19-Sep-2005 rwatson

Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(),
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).

Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions. In most cases, this means adding an
acquisition of Giant immediately around the function. In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.

With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.

NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.

NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset. This needs to be fixed, but was a problem before
this change.

NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.

MFC after: 1 week


148448 27-Jul-2005 ps

FIx for a bug in the change that made nfs_timer() MPSAFE. We need to
grab Giant before calling pru_send() (if running with mpsafenet = 0).

Found by: Jeremie Le Hen.
Fixed by: Maxime Henrion


148447 27-Jul-2005 ps

In nfs_nget() if two threads race on the same filehandle, the loser should
cause the nfsnode to get freed. This fixes a potential vnode (and nfsnode)
leak in that path.

Submitted by: Mohan Srinivasan
Reviewed by: phk


148268 21-Jul-2005 ps

Remove the NFS client rslock. The rslock was used to serialize
writers that want to extend the file. It was also used to serialize
readers that might want to read the last block of the file (with a
writer extending the file). Now that we support vnode locking for
NFS, the rslock is unnecessary. Writers grab the exclusive vnode
lock before writing and readers grab the shared (or in some cases
the exclusive) lock.

Submitted by: Mohan Srinivasan


148162 19-Jul-2005 ps

Make nfs_timer() MPSAFE. With this change, the bottom half of the NFS
client (the interface with the protocol stack and callouts) is
Giant-free.

Submitted by: Mohan Srinivasan.


148111 18-Jul-2005 ps

Fix for a NFS soft mounts bug where if the number of retries exceeds
the max rexmits, the request was not being bounced back with a
ETIMEDOUT error.

Reported by: Oliver Lehmann
Submitted by: Mohan Srinivasan


148008 14-Jul-2005 ps

Fixes for NFS crashes on architectures that require strict alignment.
- Fix nfsm_disct() so that after pulling up data, the remaining data
is aligned if necessary.
- Fix nfs_clnt_tcp_soupcall() to bcopy() the rpc length out of the
mbuf (instead of casting m_data to a uint32).

Submitted by: Pyun YongHyeon
Reviewed by: Mohan Srinivasan


147420 16-Jun-2005 green

Ifdef out the incomplete non-blocking IO implementation for NFS
pending discussion of how implementation would proceed. Applications
like -lc_r expect select(3) to match the EAGAIN-status of IO
functions.

Approved by: re


147280 10-Jun-2005 green

Fix a serious deadlock with the NFS client. Given a large enough
atomic write request, it can fill the buffer cache with the entirety
of that write in order to handle retries. However, it never drops
the vnode lock, or else it wouldn't be atomic, so it ends up waiting
indefinitely for more buf memory that cannot be gotten as it has it
all, and it waits in an uncancellable state.

To fix this, hibufspace is exported and scaled to a reasonable
fraction. This is used as the limit of how much of an atomic write
request by the NFS client will be handled asynchronously. If the
request is larger than this, it will be turned into a synchronous
request which won't deadlock the system. It's possible this value is
far off from what is required by some, so it shall be tunable as soon
as mount_nfs(8) learns of the new field.

The slowdown between an asynchronous and a synchronous write on NFS
appears to be on the order of 2x-4x.

General nod by: gad
MFC after: 2 weeks
More testing: wes
PR: kern/79208


146333 17-May-2005 des

Ugh. Previous commit got the logic exactly backward.

Submitted by: bland
Pointy hat to: des


146316 17-May-2005 des

Revision 1.173 broke updating a mount from ro to rw. Fix that by clearing
the MNT_RDONLY flag if MNT_UPDATE is set and "ro" was not specified.

Suggested by: cognet


146065 10-May-2005 rees

set R_MUSTRESEND flag in mark_for_reconnect so re-connected requests get
re-sent instead of timing out.

don't log an error message on reconnection, which is not an error.

remove unused nfs_mrep_before_tsleep.

Reviewed by: Mohan Srinivasan
Approved by: alfred


145879 04-May-2005 ps

Fix a bug in NFS/TCP where retransmissions would not reliably happen
if the server rebooted or tore down the connection for any reason.

Found by: Jonathan Noack.
Submitted by: Mohan Srinivasan.


145806 02-May-2005 iedowse

Don't copy the NFSMNT_* flags into struct statfs's f_flags field,
as they have no connection with the expected MNT_* flags. This bug
was exposed 18 months ago when the assignments to f_flags in
vfs_syscalls.c were moved to before the VFS_STATFS() call. It was
fixed in the CSRG source 10 years ago, but we never picked up that
change.

PR: kern/80390
MFC after: 1 week


145598 27-Apr-2005 des

When NFS was converted to the new mount syscall, code was written that sets
the MNT_RDONLY flag if the "ro" option was passed in from userland, and
clears it otherwise. In the diskless case, the MNT_RDONLY flag is already
set when this code is reached, but there are no mount options, so it was
incorrectly cleared. Change the logic so the MNT_RDONLY flag is set if the
"ro" option was specified, and left alone otherwise.

Note that the NFS code will still happily let you mount a filesystem RW
even if the server exports it RO. I'm not sure how to fix that.


145572 26-Apr-2005 des

While I'm here, list the new kenv (boot.netif.name) along with the others.


145570 26-Apr-2005 des

When netbooting, as soon as we've figured out which interface we booted
from, store its name in a kenv variable.


145235 18-Apr-2005 rees

TCP reconnect is not an error.
Change the message from LOG_ERR to LOG_INFO.

Approved by: alfred


145060 14-Apr-2005 jeff

- cache_lookup() relocks the parent in the DOTDOT case for us.

Spotted by: phk
Sponsored by: Isilon Systems, Inc.


145006 13-Apr-2005 jeff

- Change all filesystems and vfs_cache to relock the dvp once the child is
locked in the ISDOTDOT case. Se vfs_lookup.c r1.79 for details.

Sponsored by: Isilon Systems, Inc.


144367 31-Mar-2005 jeff

- LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.


144299 29-Mar-2005 jeff

- Remove wantparent, it is no longer necessary. An assert in vfs_lookup.c
prevents any callers from doing a modifying op without
LOCKPARENT or WANTPARENT.


144297 29-Mar-2005 jeff

- cache_lookup() now locks the new vnode for us to prevent some races.
Remove redundant code.

Sponsored by: Isilon Systems, Inc.


144206 28-Mar-2005 jeff

- We no longer have to bother with PDIRUNLOCK, lookup() handles it for us.
- Network filesystems are written with a special idiom that checks the
cache first, and may even unlock dvp before discovering that a network
round-trip is required to resolve the name. I believe dvp is prevented
from being recycled even in the forced unmount case by the shared lock
on the mount point. If not, this code should grow checks for VI_DOOMED
after it relocks dvp or it will access NULL v_data fields.

Sponsored by: Isilon Systems, Inc.


144059 24-Mar-2005 jeff

- Update vfs_root implementations to match the new prototype. None of
these filesystems will support shared locks until they are explicitly
modified to do so. Careful review must be done to ensure that this
is safe for each individual filesystem.

Sponsored by: Isilon Systems, Inc.


144040 23-Mar-2005 ps

- The NFS client was incorrectly masking SIGSTOP (which is
non-maskable).
- The NFS client needs to guard against spurious wakeups
while waiting for the response. ltrace causes the process
under question to wakeup (possibly from ptrace()), which
causes NFS to wakeup from tsleep without the response being
delivered.

Submitted by: Mohan Srinivasan


143822 18-Mar-2005 das

Don't brelse(bp) if bp is null. Also, eliminate some redundancy
and dead code.

Found by: Coverity Prevent analysis tool


143693 16-Mar-2005 phk

Use vfs_hash.


143687 16-Mar-2005 jmg

MFp4: use the function to fix the packet header length instead of rolling
our own...


143511 13-Mar-2005 jeff

- VOP_INACTIVE should no longer drop the vnode lock.

Sponsored by: Isilon Systems, Inc.


143510 13-Mar-2005 jeff

- The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem. Check that rather than VI_XLOCK.

Sponsored by: Isilon Systems, Inc.


143508 13-Mar-2005 jeff

- It is no longer necessary to lock and unlock the vnode in nfs_close() as
the top level does this for us now.

Sponsored by: Isilon Systems, Inc.


142568 26-Feb-2005 ps

Minor cleanup in nfs_request() and removal of a comment that doesn't
reflect reality.

Submitted by: Mohan Srinivasan


142233 22-Feb-2005 phk

vp->v_id is a private field for the vfs namecache and it is a big mistake
that NFS ever started using it. Long time ago I added the necessary
vhold()/vdrop() calls to replace it, but forgot to remove the v_id code.

Do it now.


142079 19-Feb-2005 phk

Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.


142070 18-Feb-2005 ps

Fix for a potential NFS client race where shared data is updated from
base context as well as the socket callback.

Submitted by: Mohan Srinivasan


141465 07-Feb-2005 jhb

Drop Giant before calling kthread_exit().


140996 29-Jan-2005 rwatson

Style cleanup for O_DIRECT sysctl comment introduced in nfs_vnops.c:1.242.


140939 28-Jan-2005 phk

Make filesystems get rid of their own vnodes vnode_pager object in
VOP_RECLAIM().


140777 24-Jan-2005 phk

Create a vnode_pager object when a file is opened.


140731 24-Jan-2005 phk

Remove unused cred arg from nfs_vinvalbuf() and many bogus arguments
passed for it.


140460 18-Jan-2005 peter

Mostly back out rev 1.33 from quite some time ago, and the followup fixes
and tweaks. The code was actually quite broken because it discarded the
upper bits of the 64 bit division. We only had a 50% chance of scaling up
the blocksize for large NFS client mounts when it was needed. For 5.x and
beyond, this was harmless because we could represent the result in either
case. For 4.x this was a big problem though. (4.x also has a df(1) bug to
compound the problem)


140220 14-Jan-2005 phk

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


140122 12-Jan-2005 brian

Include opt_bootp.h for BOOTP_NFSROOT

PR: 73183
Submitted by: Darrin Smith sdar at salseast dot org
MFC after: 7 days


140056 11-Jan-2005 phk

Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.


140048 11-Jan-2005 phk

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


139823 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


139744 05-Jan-2005 ps

If the NFS/TCP stream is out of sync between the client and server,
and if the client (erroneously) reads the RPC length as 0 bytes, the
client can loop around in the socket callback. Explicitly check for
the length being 0 case and teardown/re-connect.

Submitted by: Mohan Srinivasan


139247 23-Dec-2004 ps

Turn NFS directio off until the stability issues are resolved.


138919 16-Dec-2004 ps

Change the NFS sillyrename convention so that we won't run out
of sillyrenames (which were limited to 58 per pid per directory,
for no good reason). The new format of sillyrenames looks like

.nfs.0000b31a.00d24.4
^^^^^^^^ ^^^^^
ticks pid

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from: Yahoo!


138899 15-Dec-2004 ps

First cut of NFS direct IO support.
- NFS direct IO completely bypasses the buffer and page caches.
If a file is open for direct IO all caching is disabled.
- Direct IO for Directories will be addressed later.
- 2 new NFS directio related sysctls are added. One is a knob to
disable NFS direct IO completely (direct IO is enabled by default).
The other is to disallow mmaped IO on a file that has at least one
O_DIRECT open (see the comment in nfs_vnops.c for more details).
The default is to allow mmaps on a file that has O_DIRECT opens.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from: Yahoo!


138694 11-Dec-2004 marcel

Revert rev 1.233. The null-pointer function call (a dereference on
ia64) was not the result of a change in the vector operations. It
was caused by the NFS locking code using a FIFO and those bypassing
the vnode. This indirectly caused the panic. The NFS locking code has
been changed.

Requested by: phk


138645 10-Dec-2004 ps

In nfs_rename(), skip the otw rename operation if the fsync (to
either src or dst) fails. This closes a potential data loss case
(where the fsync failed with ENOSPC, for example).

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from: Yahoo!


138644 10-Dec-2004 ps

Store a hint in the nfsnode to detect sequential access of the file.
Kick off a readahead only when sequential access is detected. This
eliminates wasteful readaheads in random file access.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from: Yahoo!


138529 07-Dec-2004 ps

Fix for a Lock Order Reversal in the nfs_flush() path, between the
vnode interlock and the proc lock.

Reported by: marcel
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


138514 07-Dec-2004 phk

Don't clobber mnt_stat.f_mntonname


138509 07-Dec-2004 phk

The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.

ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).


138505 07-Dec-2004 ps

Always issue wakeups() to the NFS requestors under the mutex
to close all potential cases of missed wakeups.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


138496 06-Dec-2004 ps

Rewrite of the NFS client's reply handling. We now have NFS socket
upcalls which do RPC header parsing and match up the reply with the
request. NFS calls now sleep on the nfsreq structure. This enables
us to eliminate the NFS recvlock.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


138473 06-Dec-2004 ps

2 fixes that improve on the consistency of the NFS client cache.
- Change the cached mtime to a 'struct timespec' from a
time_t. Improving the precision of the cached mtime tightens up
NFS' "close-to-open" consistency considerably.
- Always force an over-the-wire consistency check from nfs_open()
(unless the file is marked modified). This further improves
NFS' "close-to-open" consistency.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


138469 06-Dec-2004 ps

Serialize NFS vinvalbuf operations by acquiring/upgrading to the
vnode EXCLUSIVE lock. This prevents threads from adding pages to
the vnode while an invalidation is in progress, closing potential
races. In the bioread() path, callers acquire the SHARED vnode lock
- so while an invalidate was in progress, it was possible to fault
in new pages onto the vnode causing the invalidation to take a while
or fail. We saw these races at Yahoo! with very large files+heavy
concurrent access. Forcing an upgrade to EXCLUSIVE lock before doing
the invalidation closes all these races.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


138463 06-Dec-2004 ps

Add non-blocking versions of nfsm_dissect() and friends, for use from
socket callbacks or similar callers, from both the NFS client and the
server.
Instituted nfsm_dissect_nonblock(), nfsm_dissect_xx_nonblock(). And
nfsm_disct() now takes an extra M_TRYWAIT/M_DONTWAIT argument.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


138460 06-Dec-2004 ps

- If all data has been committed to stable storage on the server, it
is safe to turn off the nfsnode's NMODIFIED flag.
- Move the check for signals to the top of the loop where we loop
around the dirty buffers on the vnode, scheduling writes. This
ensures that we'll break ouf of the flush operation on reception of
a signal.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


138458 06-Dec-2004 rwatson

Correct a typo in a comment.


138430 06-Dec-2004 phk

For reasons unknown, the nfs locking code used a fifo to send requests to
userland and a dedicated system call to get replies.

The vnode-bypass of fifos broke this into a panic.

Ditch all the magic and create a device /dev/nfslock instead, and
use that for both directions apart from the shorter path, this is
also faster because the device driver runs Giant free using the
vnode bypass.

Noticed by: marcel


138419 05-Dec-2004 rwatson

Convert GIANT_REQUIRED; in nfs_mountroot() to NET_ASSERT_GIANT(),
and annotate that nfs_mountroot assumes it is OK to step on the
values in the global NFSv3 diskless structure as the mountroot
function is called during a serialized part of the boot, before
any other NFS client activity occurs.

MFC after: 2 weeks


138418 05-Dec-2004 rwatson

Convert a GIANT_REQUIRED; into a NET_ASSERT_GIANT();, as sockets are
now only conditionally protected by Giant based on debug.mpsafenet.


138412 05-Dec-2004 phk

VFS_STATFS(mp, ...) is mostly called with &mp->mnt_stat, but a few cases
doesn't. Most of the implementations have grown weeds for this so they
copy some fields from mnt_stat if the passed argument isn't that.

Fix this the cleaner way: Always call the implementation on mnt_stat
and copy that in toto to the VFS_STATFS argument if different.


138411 05-Dec-2004 marcel

Fix null-pointer indirect function calls introduced in the previous
commit. In the new world order, the transitive closure on the vector
operations is not precomputed. As such, it's unsafe to actually use
any of the function pointers in an indirect function call. They can
be null, and we need to use the default vector in that case.
This is mostly a quick fix for the four function pointers that are
ed explicitly. A more generic or scalable solution is likely to see
the light of day.

No pathos on: current@


138290 01-Dec-2004 phk

Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.

Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

Give coda correct prototypes and function definitions for
all vop_()s.

Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.

Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.

Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.

Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.

Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.

Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)


138278 01-Dec-2004 phk

Remove redundant functions (repo-copied from nfsclient) for dealing with
fifos.


138276 01-Dec-2004 phk

Scripted modification of vop_* prototypes to use typedefs.


138259 01-Dec-2004 phk

Add missing #include


138256 01-Dec-2004 ps

Fix for a race between lookup and readdirplus, that causes
a deadlock (with NFS exclusive vnode locks enabled). Lookup
grabs the parent's lock and wants to lock child. Readdirplus
locks the child and wants to lock parent (for loading the attrs
for ".."). The fix is to not load the attrs for ".." in
readdirplus.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson


138255 01-Dec-2004 ps

Clean all dirty pages (dirtied by mmap'ed writes) in nfs_close().
This closes a major hole in close-to-open consistency support.
Added a new sysctl so that this can be disabled for single NFS
client applications with very large amounts of mmap'ed IO (for
performance).

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson


138254 01-Dec-2004 ps

Fix for a (blocks) underrun bug where negative values were being
returned back to df from a statfs call. Causing df to print negative
values.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson


138204 29-Nov-2004 ps

Fix for a bug in nfs_mkdir() that called vrele() instead of vput()
in the error cases, causing panics.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson


137846 18-Nov-2004 jeff

- Eliminate the acquisition and release of the bqlock in bremfree() by
setting the B_REMFREE flag in the buf. This is done to prevent lock order
reversals with code that must call bremfree() with a local lock held.
This also reduces overhead by removing two lock operations per buf for
fsync() and similar.
- Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock
has been acquired so that we may remove ourself from the free-list.
- Provide a bremfreef() function to immediately remove a buf from a
free-list for use only by NFS. This is done because the nfsclient code
overloads the b_freelist queue for its own async. io queue.
- Simplify the numfreebuffers accounting by removing a switch statement
that executed the same code in every possible case.
- getnewbuf() can encounter locked bufs on free-lists once Giant is removed.
Remove a panic associated with this condition and delay asserts that
inspect the buf until after it is locked.

Reviewed by: phk
Sponsored by: Isilon Systems, Inc.


137480 09-Nov-2004 phk

Detect root mount attempts on the flag, not on the NULL path.


137197 04-Nov-2004 phk

Retire b_magic now, we have the bufobj containing the same hint.


136927 24-Oct-2004 phk

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


136767 22-Oct-2004 phk

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


136751 21-Oct-2004 phk

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


136517 14-Oct-2004 pjd

Add a missing newline character.


136006 01-Oct-2004 das

nfsclient/nfs_bio.c has a PHOLD() without a PRELE(). Neither should
be necessary here. Also, use killproc() instead of psignal().


135874 28-Sep-2004 phk

Remove support for using NFS device nodes.


135862 27-Sep-2004 phk

Remove NFS4 vop method vector for devices: we are desupporing device nodes
on anything but DEVFS and in this case it was not even used (see below).

Put the NFS4 vop method for fifo's behind "#if 0" because it is unused.
Add a XXX comment to say that I think the unusedness is a bug.


135860 27-Sep-2004 phk

style consistency.


135280 15-Sep-2004 phk

Remove unused B_WRITEINPROG flag


134898 07-Sep-2004 phk

Explicitly pass vnode to nfs_doio() and mountpoint to nfs_asyncio().


134281 25-Aug-2004 rwatson

In nfs_timer(), pass curthread rather than &thread0 into the protocol
send routine. In IPv6 UDP, the thread will be passed to suser(), which
asserts that if a thread is used for a super user check, it be
curthread. Many of these protocol entry points probably need to
accept credentials instead of threads.

MT5 candidate.

Noticed/tested by: kuriyama


132902 30-Jul-2004 phk

Put a version element in the VFS filesystem configuration structure
and refuse initializing filesystems with a wrong version. This will
aid maintenance activites on the 5-stable branch.

s/vfs_mount/vfs_omount/

s/vfs_nmount/vfs_mount/

Name our filesystems mount function consistently.

Eliminate the namiedata argument to both vfs_mount and vfs_omount.
It was originally there to save stack space. A few places abused
it to get hold of some credentials to pass around. Effectively
it is unused.

Reorganize the root filesystem selection code.


132808 28-Jul-2004 phk

Move a relic to its correct location(s): Put nfs diskless initialization
calls with the code they call. (Yet another example of mindless copy&paste).


132805 28-Jul-2004 phk

Remove global variable rootdevs and rootvp, they are unused as such.

Add local rootvp variables as needed.

Remove checks for miniroot's in the swappartition. We never did that
and most of the filesystems could never be used for that, but it had
still been copy&pasted all over the place.


132640 25-Jul-2004 phk

Eliminate unused second argument to reassignbuf() and simplify it
accordingly.


132084 13-Jul-2004 alfred

Turn off SO_REUSEADDR and SO_REUSEPORT, they were causing EADDRINUSE
to be returned from the protocol stack.

Pointy hat to me for not groking what those options _really_ mean.


132060 12-Jul-2004 dwmalone

Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a
a better name. I have a kern_[sg]etsockopt which I plan to commit
shortly, but the arguments to these function will be quite different
from so_setsockopt.

Approved by: alfred


132023 12-Jul-2004 alfred

Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.


132018 12-Jul-2004 alfred

Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts.
Tune the timeout from 5 seconds to 12 seconds.
Provide a sysctl to show how many reconnects the NFS client has done.

Seems to fix IPv6 from: kuriyama


131840 08-Jul-2004 brian

Change the following environment variables to kernel options:

bootp -> BOOTP
bootp.nfsroot -> BOOTP_NFSROOT
bootp.nfsv3 -> BOOTP_NFSV3
bootp.compat -> BOOTP_COMPAT
bootp.wired_to -> BOOTP_WIRED_TO

- i.e. back out the previous commit. It's already possible to
pxeboot(8) with a GENERIC kernel.

Pointed out by: dwmalone


131814 08-Jul-2004 brian

Change the following kernel options to environment variables:

BOOTP -> bootp
BOOTP_NFSROOT -> bootp.nfsroot
BOOTP_NFSV3 -> bootp.nfsv3
BOOTP_COMPAT -> bootp.compat
BOOTP_WIRED_TO -> bootp.wired_to

This lets you PXE boot with a GENERIC kernel by putting this sort of thing
in loader.conf:

bootp="YES"
bootp.nfsroot="YES"
bootp.nfsv3="YES"
bootp.wired_to="bge1"

or even setting the variables manually from the OK prompt.


131717 06-Jul-2004 rwatson

Acquire socket lock in nfs_connect() connection/sleep loop to protect
socket state and avoid missed wakeups.


131697 06-Jul-2004 alfred

use vfs_suser() to restrict access to the nfs mount's timeout.


131694 06-Jul-2004 alfred

NFS mobility Phase VI:

Export NFS mount state via sysctl.
Export timeout via sysctl.


131691 06-Jul-2004 alfred

NFS mobility PHASE I, II & III (phase VI, and V pending):

Rebind the client socket when we experience a timeout. This fixes
the case where our IP changes for some reason.

Signal a VFS event when NFS transitions from up to down and vice
versa.

Add a placeholder vfs_sysctl where we will put status reporting
shortly.

Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.


131551 04-Jul-2004 phk

When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.


131021 24-Jun-2004 rwatson

When updating sb_flags, acquire the socket buffer lock to prevent
races.


130640 17-Jun-2004 phk

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


130619 17-Jun-2004 rwatson

Remove bad cookie vp kernel printf; while it does notify about an
interesting event, there's little or nothing the user can do about
it.


130554 16-Jun-2004 rwatson

Convert GIANT_REQUIRED to NET_ASSERT_GIANT where Giant is used to
protect socket operations. Leave one "as-is" as it also frobs
rootvp.


128992 06-May-2004 alc

Make vm_page's PG_ZERO flag immutable between the time of the page's
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.

Reviewed by: tegge@


128263 14-Apr-2004 peadar

Let the NFS client notice a file's size changing as a modification.
This avoids presenting invalid data to the client's applications
when the file is modified, and then extended within the window of
the resolution of the modifcation timestamp.

Reviewed By: iedowse
PR: kern/64091


128126 11-Apr-2004 marcel

Unbreak build: s/TAILQ_ISEMPTY/TAILQ_EMPTY/g


128111 11-Apr-2004 peadar

Clean up properly when unloading NFS client module.

This includes a modified form of some code from Thomas Moestl (tmm@)
to properly clean up the UMA zone and the "nfsnodehashtbl" hash
table.

Reviewed By: iedowse
PR: 16299


127977 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


127857 04-Apr-2004 rwatson

Spell 2 as SHUT_RDWR when used as an argument to soshutdown().


127797 03-Apr-2004 peadar

Flush cached access mode after modifying a files attributes for
NFSv3. It's likely that modifying the attributes will affect the
file's accessibility. This version of the patch is one suggested
by Ian Dowse after reviewing my original attempt in the PR

Reviewed By: iedowse
PR: kern/44336
MFC after: 3 days


127515 28-Mar-2004 kan

Reset callout if in nfs_timeout and rpcclnt_timeout functions. Timer
are supposed to continue firing as long as there is work to do, not
stop after the first invocation.

This is damage control after a patch that has been committed prematurely.

Tested by: kris


127421 25-Mar-2004 rees

only do nfs rpc callouts if there is work to do.

Submitted by: kan
Approved by: alfred


127144 17-Mar-2004 pjd

Add a comment with an explanation why we don't report EPIPE errors on
nfs sockets.

Requested by: ru


127137 17-Mar-2004 pjd

Don't report EPIPE errors on nfs sockets. These can be due to idle tcp
mounts which will be closed by netapp, solaris, etc. if left idle too long.

Obtained from: NetBSD


126962 14-Mar-2004 peter

Calculate NFS timeouts in units of 10ms, not 5ms. This matches the default
clock precision on i386. This is a NOP change on i386. But this stops
the mount_nfs units from suddenly changing to units of 1/20 of a second
(vs the normal 1/10 of a second) if HZ is increased.


126888 12-Mar-2004 brooks

Allow kernel with the BOOTP option to boot when DHCP/BOOTP sets the root
path to an absolute path without a host name. Previously, there was a
nasty POLA violation where a system would PXE boot until you added the
BOOTP option and then it would panic instead.

Reviewed by: tegge, Dirk-Willem van Gulik <dirkx at webweaving.org>
(a previous version)
Submitted by: tegge (getip function)


126853 11-Mar-2004 phk

Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().


126851 11-Mar-2004 phk

Remove unused second arg to vfinddev().
Don't call addaliasu() on VBLK nodes.


126425 01-Mar-2004 rwatson

Rename dup_sockaddr() to sodupsockaddr() for consistency with other
functions in kern_socket.c.

Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".

Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.

Submitted by: sam


126330 27-Feb-2004 rees

NFSv4 fixes from Connectathon 2004:

remove unused pid field of file context struct
map nfs4 error codes to errnos
eliminate redundant code from nfs4_request
use zero stateid on setattr that doesn't set file size
use same clientid on all mounts until reboot
invalidate dirty bufs in nfs4_close, to play it safe
open file for writing if truncating and it's not already open

Approved by: alfred


126105 22-Feb-2004 cperciva

If mountnfs returns an error, it will have already freed nam; no need to
free it again.

Reported by: "Ted Unangst" <tedu@coverity.com>
Approved by: rwatson (mentor)


125454 04-Feb-2004 jhb

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


125263 31-Jan-2004 obrien

Bump the NFCv3/TCP defaults for rsize and wsize from 8K to 32K to match
Solaris and HP-UX. This increases read performance for large files across NFS.

PR: 62024 & 26324
Submitted by: Bjoern Groenvall <bg@sics.se>


122953 22-Nov-2003 alfred

Use function pointers to remove the depenancy cross dependancy on nfs4
and the nfs3 client. Also fix some bugs that happen to be causing crashes
in both v3 and v4 introduced by the v4 import.

Submitted by: Jim Rees <rees@umich.edu>
Approved by: re


122736 15-Nov-2003 alfred

Move the declaration for "struct nfs4_fctx" out from under #ifdef KERNEL
for fstat(1).


122719 15-Nov-2003 alfred

unbreak LINT.


122698 14-Nov-2003 alfred

University of Michigan's Citi NFSv4 kernel client code.

Submitted by: Jim Rees <rees@umich.edu>


122523 12-Nov-2003 kan

1. Consolidate mount struct allocation/destruction into a common code in
vfs_mount_alloc/vfs_mount_destroy functions and take care to completely
destroy the mount point along with its locks. Mount struct has grown in
coplexity recently and depending on each failure path to destroy it
completely isn't working anymore.

2. Eliminate largely identical vfs_mount and vfs_unmount question by
moving the code to handle both cases into a newly introduced vfs_domount
function.

3. Simplify nfs_mount_diskless to always expect an allocated mount
struct and never attempt an allocation/destruction itself. The
vfs_allocroot allocation was there to support 'magic' swap space
configuration for diskless clients that was already removed by PHK some
time ago.

4. Include a vfs_buildopts cleanups by Peter Edwards to validate the
sanity of nmount parameters passed from userland.

Submitted by: (4) Peter Edwards <peter.edwards@openet-telecom.com>
Reviewed by: rwatson


122450 11-Nov-2003 alfred

Stop using shared locks for nfs vop locks.

The reason this was done was to avoid a race to the root when an
NFS server went down. However a semi-recent change to the way that
the kernel's lookup() routine traverses mount points prevents this.

Rev 1.39 of vfs_lookup.c changed the ordering of locks such that we
aquire a shared lock on the mount point being accessed and then drop
the directory vnode lock before requesting the target lock.

With that in place we no longer need shared locks for NFS to prevent
race to the root lockups.


122261 07-Nov-2003 sam

Assert GIANT_REQUIRED where sockets are manipulated. This is
preparatory for MPSAFE network commits and ongoing socket
locking work.

Supported by: FreeBSD Foundation


122091 05-Nov-2003 kan

Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff


121874 02-Nov-2003 kan

Take care not to call vput if thread used in corresponding vget
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.

The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.

This fixes the panic on shutdown people have reported.


121816 31-Oct-2003 brooks

Replace the if_name and if_unit members of struct ifnet with new members
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.

This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.

Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)


121205 18-Oct-2003 phk

DuH!

bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in
the file)


121201 18-Oct-2003 phk

Initialize bp->b_offset before calling VOP_STRATEGY().

Remove KASSERTS and panics with B_PHYS checks which no longer apply.


121191 18-Oct-2003 phk

We do not get B_PHYS buffers here anymore. /dev/drum is long gone.


120812 05-Oct-2003 iedowse

Since the addition of the VI_DOINGINACT flag some time ago,
VOP_INACTIVE routines need not worry about their vnode getting
recycled if they block. Remove the code from nfs_inactive() that
used vget() to get an extra vnode reference that was held during
the nfs_vinvalbuf() call.


120788 05-Oct-2003 jeff

- Remove an incorrect XXX comment. This code does respect the XLOCK since
it uses vget() which will fail if the identity changes.


120787 05-Oct-2003 jeff

- Check the XLOCK before we inspect the vnode.


120786 05-Oct-2003 jeff

- We don't need to cache_purge() in nfs_reclaim(), vclean() does it for us.


120755 04-Oct-2003 jeff

- Consistently set sopt_dir.

Pointed out by: pete@isilon.com


120736 04-Oct-2003 jeff

- Acquire the vnode interlock prior to dropping the mntvnode_mtx.
- Make a note of the lack of XLOCK protection in this code. We would access
a vnode while it is changing identities without Giant.


120730 04-Oct-2003 jeff

- Remove the backtrace() call from the *_vinvalbuf() functions. Thanks to a
stack trace supplied by phk, I now understand what's going on here. The
check for VI_XLOCK stops us from calling vinvalbuf once the vnode has been
partially torn down in vclean(). It is not clear that this would cause
a problem. Document this in nfs_bio.c, which is where the other two
filesystems copied this code from.


120264 19-Sep-2003 jeff

- Remove interlock protection around VI_XLOCK. The interlock is not
sufficient to guarantee that this race is not hit. The XLOCK will likely
have to be redesigned due to the way reference counting and mutexes work
in FreeBSD. We currently can not be guaranteed that xlock was not set
and cleared while we were blocked on the interlock while waiting to check
for XLOCK. This would lead us to reference a vnode which was not the
vnode we requested.
- Add a backtrace() call inside of INVARIANTS in the hopes of finding out if
this condition is ever hit. It should not, since we should be retaining
a reference to the vnode in these cases. The reference would be sufficient
to block recycling.


120003 12-Sep-2003 phk

Name the vnode method vectors consistently with the rest of the filesystems.

This improves the output of src/tools/tools/vop_table


119766 05-Sep-2003 phk

Remove now unused BOOTP tags related to NFS swap device.


119735 04-Sep-2003 dds

KNF: parentheses around return values.

Suggested by: bde
Approved by: schweikh (mentor - blanket)
MFC after: 6 weeks


119687 02-Sep-2003 dds

Fix errno return values to better represent failure reasons for
read and open.

Approved by: schweikh (mentor)
Agreed: bde
MFC after: 6 weeks


118944 15-Aug-2003 phk

Remove the magic way of configuring NFS backed swap.

This code dates back to the very first diskless support on FreeBSD,
back when swapon(8) couldn't simply be run on a NFS backed file.

Suggested replacement command sequence on the client:

dd if=/dev/zero of=/swapfile bs=1k count=1 oseek=100000
swapon /swapfile
rm -f /swapfile

For whatever value of 100000 you want.


118639 07-Aug-2003 billf

0) preallocate per-interface context structures without the ifnet lock held
1) avoid immediately calling bzero() after malloc() by passing M_ZERO
2) do not initialize individual members of the global context to zero
3) remove an unused assignment of ifctx in bootpc_init()

Reviewed by: tegge


118135 29-Jul-2003 tjr

Fix a problem that occurs when truncating files on NFSv3 mounts: we need
to set np->n_size back to the desired size again after calling
nfs_meta_setsize(), since it could end up in nfs_loadattrcache() getting
called, which would change n_size back to the value it had before the
truncate request was issued. The result of this bug is that the size info
cached in the nfsnode becomes incorrect, lseek(fd, ofs, SEEK_END) seeks
past the end of the file, stat() returns the wrong size, etc.

PR: 41792
MFC after: 2 weeks


118094 27-Jul-2003 phk

Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.


117152 02-Jul-2003 phk

Change idle sleep indentifier to "-" for nfsiod


116461 17-Jun-2003 alc

Lock the vm object when freeing a page.


116412 15-Jun-2003 phk

Add the same KASSERT to all VOP_STRATEGY and VOP_SPECSTRATEGY implementations
to check that the buffer points to the correct vnode.


116271 12-Jun-2003 phk

Initialize struct vfsops C99-sparsely.

Submitted by: hmp
Reviewed by: phk


116260 12-Jun-2003 iedowse

When removing a sillyrename file, make sure that the directory vnode
has not been cleaned in the meantime, since this can happen during
a forced unmount. Also add a comment that nfs_removeit() should
really be locking the directory vnode before calling nfs_removerpc().

Reported by: mbr
Tested by: mbr
MFC after: 1 week


116189 11-Jun-2003 obrien

Use __FBSDID().


116185 11-Jun-2003 rwatson

Add the comment I meant to add about not passing in PCATCH to the
tsleep(). Note the XXX.


116070 09-Jun-2003 hsu

On a socket creation error, don't close the socket.


115533 31-May-2003 phk

Remove unsed variables.
Add explicit breaks to switch

Found by: FlexeLint


115456 31-May-2003 phk

The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent
deadlocks with vnode backed md(4) devices because md now uses a
kthread to run the bio requests instead of doing it directly from
the bio down path.


115415 30-May-2003 rwatson

rpc.lockd stability workaround: remove PCATCH from the tsleep() in
nfs_lock.c. Right now, if we permit a signal to interrupt the sleep,
we will slip the lock and no process on that client, the server, or
any other client will be able to acquire the lock. This can happen,
for example, if a user hits Ctrl-C or Ctrl-T while a process is
waiting for the lock. By removing PCATCH, we prevent that from
happening, at the cost of not permitting a user-requested lock abort:
also nasty. However, a user interface bug might be preferable to a
serious semantic bug, so we go with that for now.

We need to teach the rpc.lockd/kernel protocol how to abort lock
requests, and rpc.lockd how to handle aborted lock requests; patches
for the kernel bit are floating around, but no rpc.lockd bit yet.

Approved by: re (scottl)


115172 19-May-2003 peter

Deal with the possibility of negative available space from the file server
to avoid Bad Things(TM) happening (eg: df crashing with a floating point
exception).

Submitted by: Harold Gutch <logix@foobar.franken.de>
Approved by: re (scottl)


115041 15-May-2003 rwatson

This change grabs the vnode lock for NFS client vnodes when calling
VOP_SETATTR() or VOP_GETATTR(); without these locks (a) VFS_DEBUG_LOCKS
will panic, and (b) it may be possible to corrupt entries in the cached
vnode attributes in the nfsnode, since nfsnode attribute cache data is
also protected by the vnode lock.

Approved by: re (jhb)
Pointed out by: VFS_DEBUG_LOCKS


114983 13-May-2003 jhb

- Merge struct procsig with struct sigacts.
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.

Reviewed by: arch@
Approved by: re (rwatson)


114434 01-May-2003 des

Instead of recording the Unix time in a process when it starts, record the
uptime. Where necessary, convert it back to Unix time by adding boottime
to it. This fixes a potential problem in the accounting code, which would
compute the elapsed time incorrectly if the Unix time was stepped during
the lifetime of the process.


114216 29-Apr-2003 kan

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


113986 24-Apr-2003 truckman

VOP_FSYNC() expects to be called with the vnode locked, so lock fvp in
nfs_rename() before calling VOP_FSYNC() and unlock fvp immediately after.

Reviewed by: bde


113985 24-Apr-2003 peter

Fix a bug with df on large (>1TB) nfsv3 file servers on 32 bit client
machines where the 'long' number of blocks in struct statfs wont fit.
Instead of chosing an artificial 512 byte block size, simply scale it up
until we avoid an overflow. NFSv3 reports the sizes in bytes, and the
blocksize is a figment of nfsclient's imagination.


113885 23-Apr-2003 truckman

Release the vnode interlock in nfs_flush() before calling nfs_sigintr(),
and grab it again later if necessary. This prevents a lock order reversal
because nfs_sigintr() calls PROC_LOCK().


112892 31-Mar-2003 thomas

Revert change 1.201 (removing mapping of VAPPEND to VWRITE).
Instead, use the generic vaccess() operation to determine whether
an operation is permitted. This avoids embedding knowledge on
vnode permission bits such as VAPPEND in the NFS client.

PR: kern/46515
vaccess() patch submitted by: "Peter Edwards" <pmedwards@eircom.net>
Approved by: tjr, roberto (mentor)


112888 31-Mar-2003 jeff

- Move p->p_sigmask to td->td_sigmask. Signal masks will be per thread with
a follow on commit to kern_sig.c
- signotify() now operates on a thread since unmasked pending signals are
stored in the thread.
- PS_NEEDSIGCHK moves to TDF_NEEDSIGCHK.


112685 26-Mar-2003 rwatson

Add O_NONBLOCK to the vn_open_cred() flags for NFS client locking when
opening the POSIX fifo; convert ENXIO error returns to EOPNOTSUPP.

This improves handling of the case where the /var/run/lock fifo exists
but there is no listener: we immediately return EOPNOTSUPP rather
than blocking until a listener turns up. This could occur during a
diskless boot before rpc.lockd is loaded, or if the lock file persists
across a reboot following the disabling of rpc.lockd. This may have
suddenly started to occur due to fifo blocking fixes--previously it
looks like attempts to read on a fifo with no listener would time out
due to insufficient resources.

Reviewed by: alfred


112657 26-Mar-2003 alfred

req can not be NULL or we'd die.

Sponsored by: RED


112455 21-Mar-2003 tjr

Map VAPPEND to VWRITE in nfsspec_access() - VAPPEND is never set in the
mode returned by VOP_GETATTR. This fixes incorrect "Permission denied"
errors when trying to append to a file on an NFSv2 mount.


112225 14-Mar-2003 jeff

- Add a forgotten BUF_LOCK()

Most sincere apologies to: jake


112178 13-Mar-2003 jeff

- Lock the buf before inspecting its contents.


111856 04-Mar-2003 jeff

- Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by: arch
Not objected to by: mckusick


111841 03-Mar-2003 njl

Finish cleanup of vprint() which was begun with changing v_tag to a string.
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.


111748 02-Mar-2003 des

More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).


111514 26-Feb-2003 jeff

- The interlock was not being droped in nfs_flush() if the first part of
an if clause was true. Break the two clauses out into seperate statements
since they require different actions.

Reported/Tested by: jake
Spotted by: jhb


111475 25-Feb-2003 jeff

- Properly handle the vnode interlock in nfs_fsync.

Reported by: phk


111463 25-Feb-2003 jeff

- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick


111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


111105 18-Feb-2003 peter

Get rid of a silly message I added back in Sept 2001 (1.68).


110922 15-Feb-2003 tjr

Lock proc while accessing p_siglist, p_sigmask and p_sigignore
in nfs_sigintr().


109704 22-Jan-2003 dillon

Provide a sysctl to allow defaulting of the connectionless (-c) feature
to mount_nfs. The sysctl defaults to 1 (paranoid mode). Setting it to 0
will allow an NFS client to receive replies on a different IP then they
were sent to by default.

Submitted by: Sean Eric Fagan <sef@kithrup.com>


109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


108648 04-Jan-2003 phk

Since Jeffr made the std* functions the default in rev 1.63 of
kern/vfs_defaults.c it is wrong for the individual filesystems to use
the std* functions as that prevents override of the default.

Found by: src/tools/tools/vop_table


108589 03-Jan-2003 phk

Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since
all BUF_STRATEGY did in the first place was call VOP_STRATEGY.


108470 30-Dec-2002 schweikh

Fix typos, mostly s/ an / a / where appropriate and a few s/an/and/
Add FreeBSD Id tag where missing.


108357 28-Dec-2002 dillon

Abstract-out the constants for the sequential heuristic.

No operational changes.

MFC after: 1 day


108250 24-Dec-2002 hsu

SMP locking for radix nodes.


108198 23-Dec-2002 alc

Avoid holding the vnode interlock around malloc() or free() to prevent a
lock order reversal.

Reviewed by: jeff


108172 22-Dec-2002 hsu

SMP locking for ifnet list.


108161 21-Dec-2002 dillon

do not try to free a mountpoint that we did not allocate.

X-MFC after: immediately


107104 20-Nov-2002 alfred

reapply 1.26 through 1.28.

Approved by: re


107101 20-Nov-2002 alfred

forgot about 5.x freeze, backout 1.26 through 1.28 pending re@ appoval.


107100 20-Nov-2002 alfred

remove useless casts, unused macros and cleanup a line wrap.


107099 20-Nov-2002 alfred

comment and untwist error return logic


107098 20-Nov-2002 alfred

Remove an outdated comment complaining about exporting struct ucred
to userspace, I fixed it a while ago.


105568 20-Oct-2002 phk

Don't examine an un-initialized variable.

Spotted by: FlexeLint.


105563 20-Oct-2002 phk

Remove extern declarations of stuff which is static in nfs_node.c
Move related macro to nfs_node.c

Spotted by: FlexeLint


105077 14-Oct-2002 mckusick

Regularize the vop_stdlock'ing protocol across all the filesystems
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).

In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.

Sponsored by: DARPA & NAI Labs.


104908 11-Oct-2002 mike

Change iov_base's type from `char *' to the standard `void *'. All
uses of iov_base which assume its type is `char *' (in order to do
pointer arithmetic) have been updated to cast iov_base to `char *'.


104354 02-Oct-2002 scottl

Some kernel threads try to do significant work, and the default KSTACK_PAGES
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create. Passing the
value 0 prevents the alternate kstack from being created. Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.

Reviewed by: jake, peter, jhb


104306 01-Oct-2002 jmallett

Back our kernel support for reliable signal queues.

Requested by: rwatson, phk, and many others


104241 30-Sep-2002 jmallett

Lock access to the signal queue, and related structures, with PROC_LOCK.

Submitted by: jhb


104235 30-Sep-2002 jmallett

Convert use of p_siglist and old SIG*() macros to use <sys/ksiginfo.h>
prototyped functions to get a sigset_t, and further to check for any
queued signals, rather than an empty signal set, to go with the move
to signal queues rather than signal sets.


104094 28-Sep-2002 phk

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


104024 27-Sep-2002 rwatson

Remove an errant debugging printf that got left in during my last
commit.

Pointed out by: guido


104016 26-Sep-2002 rwatson

Apparently pxeboot passes in a mygateway of non-zero sin length
from DHCP in the event that no gateway is returned from DHCP, breaking
the assumption that we skip the routing insertion of the gateway
if the sin length is zero. Check also for s_addr of 0 to avoid the
"Oh no, adding my default route failed" panic, making it possible
to pxeboot machines on segments without default routes. Arguably
this could be a bug in pxeboot, or in the TUNABLE code, but this
makes my boxes boot.


103939 25-Sep-2002 jeff

- Lock access to the buf lists.
- Use vrefcnt() where appropriate.
- Add some locking asserts.


103770 22-Sep-2002 jake

Moved nfs_diskless setup code from autoconf.c to nfsclient/nfs_diskless.c
so that it is MI. Allow nfs_mountroot to return an error if the nfs_diskless
struct is not valid, rather than panicing later on. Call nfs_setup_diskless()
from nfs_mountroot if NFS_ROOT is defined, like bootpc_init(). Removed legacy
root mount support for sparc64, and enabled NFS_ROOT by default.


103554 18-Sep-2002 phk

Use m_length() instead of home-rolled versions.


103314 14-Sep-2002 njl

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


103099 08-Sep-2002 phk

Now that we have a cached mount credential in struct mount, use it istead
of a private cached copy.


102966 05-Sep-2002 bde

Use `struct uma_zone *' instead of uma_zone_t, so that <sys/uma.h> isn't
a prerequisite.


102052 18-Aug-2002 sobomax

Increase size of ifnet.if_flags from 16 bits (short) to 32 bits (int). To avoid
breaking application ABI use unused ifreq.ifru_flags[1] for upper 16 bits in
SIOCSIFFLAGS and SIOCGIFFLAGS ioctl's.

Reviewed by: -hackers, -net


101947 15-Aug-2002 alfred

Remove a case of exposing 'struct ucred' to userspace. Use a struct xucred
for LOCKD_MSG instead.

Requested by: rwatson


101941 15-Aug-2002 rwatson

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101777 13-Aug-2002 phk

Introduce typedefs for the member functions of struct vfsops and employ
these in the main filesystems. This does not change the resulting code
but makes the source a little bit more grepable.

Sponsored by: DARPA and NAI Labs.


101744 12-Aug-2002 rwatson

Pass IO_NOMACCHECK to vn_rdwr() in the following checks to prevent
enforcement of MAC policy on the read or write operations:

- In ext2fs, don't enforce MAC on loop-back reads and writes supporting
directory read operations in lookup(), directory modifications in
rename(), directory write operations in mkdir(), symlink write
operations in symlink().

- In the NFS client locking code, perform vn_rdwr() on the NFS locking
socket without enforcing MAC, since the write is done on behalf of
the kernel NFS implementation rather than the user process.

- In UFS, don't enforce MAC on loop-back reads and writes supporting
directory read operations in lookup(), and symlink write operations
in symlink().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101364 05-Aug-2002 jeff

- Add a missing VI_UNLOCK to an error case in nfs_flush.


101308 04-Aug-2002 jeff

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


100450 21-Jul-2002 alc

o Lock page queue accesses in nfs_getpages().


100194 16-Jul-2002 dillon

Fix a bug nfs_write() related to ^C'ing during a file write on an
interruptable mount. We were returning from inside the loop without
releasing the rslock.

Submitted by: Mike Junk <junk@isilon.com>
MFC after: 3 days


100175 16-Jul-2002 jhb

If we get a receive error in nfs_receive() and then get an error trying to
obtain the send lock, we would bogusly try to unlock the send lock before
returning resulting in a panic. Instead, only unlock the send lock if
nfs_sndlock() succeeds and nfs_reconnect() fails.

MFC after: 3 days
Sponsored by: The Weather Channel


100134 15-Jul-2002 alfred

Add IPv6 support.

Submitted by: Jean-Luc Richier <Jean-Luc.Richier@imag.fr>


99797 11-Jul-2002 dillon

Convert old style (type foo *)0 casts to NULLs

PR: kern/40360
Requested by: Hiten PAndya via direct email


99737 10-Jul-2002 dillon

Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on. Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
Dirties more cache lines during lookups.

Not as fast as a hash table lookup (but still N log N and optimal
when there is locality of reference).

Advantages
vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
syncer operate more efficiently.

I get to rip out all the old hacks (some of which were mine) that tried
to keep the v_dirtyblkhd tailq sorted.

The per-vnode splay tree should be easier to lock / SMPng pushdown on
vnodes will be easier.

This commit along with another that Alan is working on for the VM page
global hash table will allow me to implement ranged fsync(), optimize
server-side nfs commit rpcs, and implement partial syncs by the
filesystem syncer (aka filesystem syncer would detect that someone is
trying to get the vnode lock, remembers its place, and skip to the
next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc


98988 28-Jun-2002 jhb

In namei(), we use a NULL thread for uio_td when doing a VOP_READLINK().
nfs_readlink() calls nfs_bioread() which passes in uio_td as the thread
argument to nfs_getcacheblk(). In nfs_getcacheblk() we dereference the
thread pointer to get a process pointer to pass to nfs_sigintr(). This
obviously results in a panic. :)

Rather than change nfs_getcacheblk() to check if the thread pointer is
NULL when calling nfs_sigintr() like other callers do, change
nfs_sigintr() to take a thread as the last argument instead of a
process so none of the callers have to care if the thread is NULL or not.


97658 31-May-2002 tanimura

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


97333 27-May-2002 dd

Don't tsleep() with an sb_mtx held.


97211 24-May-2002 peter

Fix warning; deprecated use of label at end of compound statement


96972 20-May-2002 tanimura

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


96825 17-May-2002 ambrisko

Add TAG_VENDOR_INDENTIFIER (option 60) to our DHCP request done by the
kernel BOOTP option. The format will be:
FreeBSD:<MACHINE>:<osrelease>
this way people can tune their DHCP server to server up root file systems
via the OS, machine type and version.

Obtained from: NetBSD
MFC after: 3 weeks


96755 16-May-2002 trhodes

More s/file system/filesystem/g


95662 28-Apr-2002 phk

We don't need the arp kludge any more.


95590 27-Apr-2002 iedowse

Remove the nfs_{lock,unlock,islocked} functions and the associated
definitions; they have been unused and #if 0'd out since the Lite/2
merge and we are unlikely to want them in the future.


94903 17-Apr-2002 iedowse

The recent NFS forced unmount improvements introduced a side-effect
where some client operations might be unexpectedly cancelled during
an unsuccessful non-forced unmount attempt. This causes problems
for amd(8), because it periodically attempts a non-forced unmount
to check if the filesystem is still in use.

Fix this by adding a new mountpoint flag MNTK_UNMOUNTF that is set
only during the operation of a forced unmount. Use this instead of
MNTK_UNMOUNT to trigger the cancellation of hung NFS operations.

Also correct a problem where dounmount() might inadvertently clear
the MNTK_UNMOUNT flag.

Reported by: simokawa
MFC after: 1 week


93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


92783 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.


92219 13-Mar-2002 luigi

Add a readonly sysctl variable of type string, kern.bootp_cookie,
which is initialized with whatever string a dhcp/bootp server passes
as vendor tag 134.
There is no standard tag that I know with this information, and
no vendor-defined tag that applies to FreeBSD that I could find
doing the same thing.

The intended use is to pass information to userland for run-time
configuration of a diskless client without having to run a bootp/dhcp
client for the third time (after the one in pxeboot/etherboot, and
the one in the kernel bootp), also because these clients generally
screwup the interface configuration, which is not exactly what you
want when you have your disks nfs-mounted.

Manpage update to follow soon.

MFC-after: 3 days


91888 08-Mar-2002 phk

vhold() our vnode while checking the remote side.

This is belived to be the only place where a soft reference to a vnode
is held with no sort of hard reference, consequently this change should
allow us to free(9) vnodes from the freelist after properly cleaning
them up.

Reviewed by: dillon


91460 28-Feb-2002 peter

Fix warnings.. bootpc_init() and related.


91420 27-Feb-2002 jhb

Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic
and isn't strictly required. However, it lowers the number of false
positives found when grep'ing the kernel sources for p_ucred to ensure
proper locking.


91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


90373 07-Feb-2002 peter

Fix a long line touched in previous commit (but not caused by previous
commit)


90361 07-Feb-2002 julian

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


89407 15-Jan-2002 peter

Revise the nfsiod auto tuning code. Now both the upper and lower limits
are specifyable by sysctl and are respected.

Submitted by: Maxime Henrion <mux@sneakerz.org>


89324 14-Jan-2002 peter

Implement vfs.nfs.iodmin (minimum number of nfsiod's) and
vfs.nfs.iodmaxidle (idle time before nfsiod's exit). Make it adaptive
so that we create nfsiod's on demand and they go away after not being
used for a while. The upper limit is NFS_MAXASYNCDAEMON (currently 20).
More will be done here, but this is a useful checkpoint.

Submitted by: Maxime Henrion <mux@qualys.com>


89174 10-Jan-2002 iedowse

Terminate requests in nfs_sigintr() if the filesystem is in the
process of being unmounted. This allows forced NFS unmounts to
complete even if there are processes stuck holding the mnt_lock
while the server is down. The mechanism is not ideal in that there
is a small chance we might accidentally cancel requests during a
failed non-forced unmount attempt on that filesystem, but this
is not really a big problem.

Also, move the tsleep() in nfs_nmcancelreqs() so that we do not
sleep in the case where there are no requests to be cancelled.


88796 02-Jan-2002 iedowse

Permit NFS filesystems to be forcibly unmounted when the server is
down, even if there are hung processes and the mount is non-
interruptible.

This works by having nfs_unmount call a new function nfs_nmcancelreqs()
in the FORCECLOSE case. It scans the list of outstanding requests
and marks as interrupted any requests belonging to the specified
mount. Then it waits up to 30 seconds for all requests to terminate.
A few other changes are necessary to support this:
- Unconditionally set a socket timeout so that even hard mounts
are guaranteed to occasionally check the R_SOFTTERM flag on
requests. For hard mounts this flag can only be set by
nfs_nmcancelreqs().
- Reject requests on a mount that is currently being unmounted.
- Never grant the receive lock to a request that has been cancelled.

This should also avoid an old problem where a forced NFS unmount
could cause a crash; it occurred when a VOP on an unlocked vnode
(usually VOP_GETATTR) was in progress at the time of the forced
unmount.


88777 01-Jan-2002 alc

o Remove an errant ';' introduced in the last revision.
o Remove an unused variable.


88770 01-Jan-2002 rwatson

o Remove premature use of nmp->nm_cred, it hasn't been initialized yet.


88746 31-Dec-2001 rwatson

o Pass td into nfs_mountroot() to eliminate an XXX'd curthread use.
Since it's in the parent function anyway, might as well pass it
another layer down.

Obtained from: TrustedBSD Project


88745 31-Dec-2001 rwatson

o Remove premature leakage of use of td_ucred from base source tree:
instead, use td->td_proc->p_ucred.


88743 31-Dec-2001 rwatson

o Add missing #include's of sys/proc.h, missed in merge, required to
dereference td->td_proc->p_ucred.


88739 31-Dec-2001 rwatson

o Make the credential used by socreate() an explicit argument to
socreate(), rather than getting it implicitly from the thread
argument.

o Make NFS cache the credential provided at mount-time, and use
the cached credential (nfsmount->nm_cred) when making calls to
socreate() on initially connecting, or reconnecting the socket.

This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well
as bugs involving NFS and mandatory access control implementations.

Reviewed by: freebsd-arch


88712 30-Dec-2001 iedowse

Add a #define for the size of the nfs_backoff[] array, and use this
instead of magic constants in the code.


88680 30-Dec-2001 ambrisko

Increase the buffer size to hold a bootp/DHCP reply from 256 bytes to
1222 bytes (derived as the maximum that isc-dhcpd uses). This solves
the problem if a bootp/DHCP reply is over 256 bytes in which the
end of the bootp/DHCP reply will not be found and then the reply will
be ignored. This happens when swap and root paths are longish or many
parameters are set.

Reviewed by: imp
Approved by: imp


88541 27-Dec-2001 dillon

nfs_nget() does no locking whatsoever when looking up a vnode. If the
vget() sleeps we have to retry the operation to avoid racing against
a deletion.

MFC maybe: submitted to re's


88091 18-Dec-2001 iedowse

Avoid passing the variable `tl' to functions that just use it for
temporary storage. In the old NFS code it wasn't at all clear if
the value of `tl' was used across or after macro calls, but I'm
fairly confident that the convention was to keep its use local.
Each ex-macro function now uses a local version of this variable,
so all of the double-indirection goes away.

The only exception to the `local use' rule for `tl' is nfsm_clget(),
which is left unchanged by this commit.

Reviewed by: peter


87834 14-Dec-2001 dillon

This fixes a large number of bugs in our NFS client side code. A recent
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.

* An edge case with shared R+W mmap()'s and truncate whereby
the system would inappropriately clear the dirty bits on
still-dirty data. (applicable to all filesystems)

THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
see vm/vm_page.c line 1641

* The straddle case for VM pages and buffer cache buffers when
truncating. (applicable to NFS client side)

* Possible SMP database corruption due to vm_pager_unmap_page()
not clearing the TLB for the other cpu's. (applicable to NFS
client side but could effect all filesystems). Note: not
considered serious since the corruption occurs beyond the file
EOF.

* When flusing a dirty buffer due to B_CACHE getting cleared,
we were accidently setting B_CACHE again (that is, bwrite() sets
B_CACHE), when we really want it to stay clear after the write
is complete. This resulted in a corrupt buffer. (applicable
to all filesystems but probably only triggered by NFS)

* We have to call vtruncbuf() when ftruncate()ing to remove
any buffer cache buffers. This is still tentitive, I may
be able to remove it due to the second bug fix. (applicable
to NFS client side)

* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
to set n_size before calling nfs_vinvalbuf or the NFS code
may recursively vnode_pager_setsize() to the original value
before the truncate. This is what was causing the user mmap
bus faults in the nfs tester program. (applicable to NFS
client side)

* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
by Kirk).

Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after: 1 week


86363 14-Nov-2001 rwatson

o Modify nfslockdans() to accept a thread reference instead of a proc
reference: with td->td_ucred, it will be desirable to authorize
based on td->td_ucred, rather than p->p_ucred.
o Since the same variable 'p' was later used with pfind() on the target
process for the wakeup, introduce a new local variable 'targetp'
to use instead.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


86284 12-Nov-2001 alfred

Allow users to use the 'nolockd' or -L options with mount_nfs in order
to avoid the need for rpc.lockd to perform client locks. Using
this option a user can revert back to using local locks for NFS mounts
like we did before we had rpc.lockd.


86278 11-Nov-2001 alfred

turn vn_open() into a wrapper around vn_open_cred() which allows
one to perform a vn_open using temporary/other/fake credentials.

Modify the nfs client side locking code to use vn_open_cred() passing
proc0's ucred instead of the old way which was to temporary raise
privs while running vn_open(). This should close the race hopefully.


86089 05-Nov-2001 dillon

Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write. This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use. The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after: 1 week


85398 24-Oct-2001 rwatson

o Note an additional potential problem here: LOCKD_MSG directly exports
struct ucred to userland. In 5.0-CURRENT, it is desirable to instead
export struct xucred, as ucred contains mutexes, pointers, and other
kernel evil. I'll add it to my work queue.


85370 23-Oct-2001 rwatson

o Add two comments identifying problems with the current nfs_lock.c
implementation, so that the information doesn't get lost.
(1) /var/run/lock is looked up relative to the current thread's root
directory, but it's not clear that's desirable.
(2) A race condition associated with live credential modification on
a shared credential is present when privilege is granted for
the purposes of talking to /var/run/lock.


85339 23-Oct-2001 dillon

Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after: 3 days


84827 11-Oct-2001 jhb

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


84726 09-Oct-2001 jhb

Use crhold() instead of crdup() since we aren't modifying the cred but
just need to ensure it remains immutable.


84700 09-Oct-2001 peter

Make this compile after last commit. It should be:
"td ? td->td_proc : NULL", not "td ? td->td_proc, NULL"


84690 08-Oct-2001 julian

Don't dereference td if it's NULL.

Submitted by: Alexander N. Kabaev <ak03@gte.com>


84079 28-Sep-2001 peter

Unwind some more macros. NFSMADV() was kinda silly since it was right
next to equivalent m_len adjustments. Move the nfsm_subs.h macros
into groups depending on which phase they are used in, since that
affects the error recovery requirements. Collect some of the common error
checking into a single macro as preparation for unwinding some more.
Have nfs_rephead return a value instead of secretly modifying args.
Remove some unused function arguments that were being passed around.
Clarify nfsm_reply()'s error handling (I hope).


84057 27-Sep-2001 peter

Make nfsm_dissect() have an obvious return value.


84002 27-Sep-2001 peter

Tidy up nfsm_build usage. This is only partially finished.


83914 25-Sep-2001 iedowse

Add a missing dereference level. This caused nfsm_postop_attr_xx()
to try and extract node attributes from an RPC reply even if none
were present.

Reviewed by: peter


83697 20-Sep-2001 peter

Add the magic marker so that loader and kldload(2) can find this in
module form automagically.


83694 20-Sep-2001 peter

Oops. Fix a missing indirection level. gcc didn't complain about it on
x86, but did complain about it on alpha (since int and pointer are
different sizes)


83654 18-Sep-2001 peter

Sigh, Last minute pre-merge typo. (missing quotes)


83651 18-Sep-2001 peter

Cleanup and split of nfs client and server code.
This builds on the top of several repo-copies.


83629 18-Sep-2001 imp

nfs_strategy calls nfs_asyncio with td as NULL. So add a bandaid that
will pass NULL as the struct proc when td is NULL. This has stopped
crashing on my machine.

Note: The passing of NULL may be bogus, but I'll let others fix that
problem.

Reviewed by: jhb


83493 15-Sep-2001 peter

Sync some differences that were different between the copies of the files
that were in nfs/nfs.h and nfsserver/nfs.h in the p4 tree.


83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


83291 10-Sep-2001 kris

Fix some signed/unsigned integer confusion, and add bounds checking of
arguments to some functions.

Obtained from: NetBSD
Reviewed by: peter
MFC after: 2 weeks


82702 31-Aug-2001 dillon

Pushdown Giant for nfs syscalls (nfssvc())


82213 23-Aug-2001 ache

Stupid error from my side in prev. commit: || -> &&


82204 23-Aug-2001 ache

Implement l_len<0 per POSIX check.
Check for valid l_whence too.


82194 23-Aug-2001 ache

Even better move: suppose that server is able to handle SEEK_END,
so check arguments for all but not SEEK_END case, leaving SEEK_END
handling for server


82193 23-Aug-2001 ache

Apparently SEEK_END locking not supported by NFS. Previous variant
returns EINVAL in that case, change it to EOPNOTSUPP.


82190 23-Aug-2001 ache

Move <machine/*> after <sys/*>

Pointed by: bde


82174 23-Aug-2001 ache

adv. lock:
detect off_t overflow _before_ it occurse and return EOVERFLOW instead of
EINVAL


80891 01-Aug-2001 iedowse

Fix a client-side memory leak in nfs_flush(). The code allocates
a temporary array to store struct buf pointers if the list doesn't
fit in a local array. Usually it frees the array when finished,
but if it jumps to the 'again' label and the new list does fit in
the local array then it can forget to free a previously malloc'd
M_TEMP memory.

Move the free() up a line so that it frees any previously allocated
memory whether or not it needs to malloc a new array.

Reviewed by: dillon


80672 30-Jul-2001 peter

Check the filehandle size when mounting.

Obtained from: Constantine Sapuntzakis <csapuntz@openbsd.org>


79247 04-Jul-2001 jhb

- Sort includes.
- Update vmmeter statistics for vnode pagein/pageouts in getpages/putpages.


79224 04-Jul-2001 dillon

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


78911 28-Jun-2001 jhb

- Protect the mnt_vnode list with the mntvnode lock.
- Use queue(9) macros.


77563 01-Jun-2001 jake

Unlock the process returned from pfind() if it does not return NULL.
This fixes a witness lock violation for nfssvc returning with locks
held.

Submitted by: Jean-Luc Richier <Jean-Luc.Richier@imag.fr>
PR: kern/27776


77183 25-May-2001 rwatson

o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit


77086 23-May-2001 jhb

Assert Giant is held by the caller rather than getting it and releasing
it in getpages/putpages.


77031 23-May-2001 ru

- FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file
systems were repo-copied from sys/miscfs to sys/fs.

- Renamed the following file systems and their modules:
fdesc -> fdescfs, portal -> portalfs, union -> unionfs.

- Renamed corresponding kernel options:
FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS.

- Install header files for the above file systems.

- Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland
Makefiles.


76827 19-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


76688 16-May-2001 iedowse

Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by: phk, bp


76166 01-May-2001 markm

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


76131 29-Apr-2001 phk

Add a vop_stdbmap(), and make it part of the default vop vector.

Make 7 filesystems which don't really know about VOP_BMAP rely
on the default vector, rather than more or less complete local
vop_nopbmap() implementations.


76118 29-Apr-2001 alfred

Remove incorrect comment.

Submitted by: quinot@inf.enst.fr <quinot@inf.enst.fr>
PR: kern/26893


76117 29-Apr-2001 grog

Revert consequences of changes to mount.h, part 2.

Requested by: bde


75858 23-Apr-2001 grog

Correct #includes to work with fixed sys/mount.h.


75692 19-Apr-2001 alfred

vnode_pager_freepage() is really vm_page_free() in disguise,
nuke vnode_pager_freepage() and replace all calls to it with vm_page_free()


75631 17-Apr-2001 alfred

Implement client side NFS locks.

Obtained from: BSD/os
Import Ok'd by: mckusick, jkh, motd on builder.freebsd.org


75580 17-Apr-2001 phk

This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.


75402 11-Apr-2001 peter

Create debug.hashstat.[raw]nchash and debug.hashstat.[raw]nfsnode to
enable easy access to the hash chain stats. The raw prefixed versions
dump an integer array to userland with the chain lengths. This cheats
and calls it an array of 'struct int' rather than 'int' or sysctl -a
faithfully dumps out the 128K array on an average machine. The non-raw
versions return 4 integers: count, number of chains used, maximum chain
length, and percentage utilization (fixed point, multiplied by 100).
The raw forms are more useful for analyzing the hash distribution, while
the other form can be read easily by humans and stats loggers.


75218 05-Apr-2001 rwatson

o Rather than arbitrarily construct a credential in the nfs_statfs()
VFS operation, make use of the calling process's credential. This
solution may not be ideal (there are a number of other possible
proposals, including making use of the proc0 credential, adding a
credential argument to the VFSOP, and switching from a hard-coded
ucred to a hard-coded nfscred), it is simple and appears to
work. The arguments against using simply crget() are fairly
strong: it is the only place in the code (other than a nearly
identical invocation in ncp) where crget() is invoked, other than
in the process credential creation code; as ucred becomes extensible,
this use of crget() without appropriate context results in less and
less meaningful credential data. The implementation here will
probably be tweaked as a result of experimentation and further
exploration of the requirements. In the mean-time, it allows
progress to be made in ucred expansion for new security models without
causing a crash every time df is used on an NFS mounted file system.

This code has been interop tested against FreeBSD and Solaris NFS
servers. While using the process credentials should not introduce
interop problems, please let me know if any turn out to exist.

Reviewed by: freebsd-arch


74501 20-Mar-2001 peter

Use the same API as the example code.
Allow the initial hash value to be passed in, as the examples do.
Incrementally hash in the dvp->v_id (using the official api) rather than
add it. This seems to help power-of-two predictable filename trees
where the filenames repeat on a power-of-two cycle and the directory trees
have power-of-two components in it. The simple add then mask was causing
things like 12000+ entry collision chains while most other entries have
between 0 and 3 entries each. This way seems to improve things.


74384 17-Mar-2001 peter

Use a generic implementation of the Fowler/Noll/Vo hash (FNV hash).
Make the name cache hash as well as the nfsnode hash use it.

As a special tweak, create an unsigned version of register_t. This allows
us to use a special tweak for the 64 bit versions that significantly
speeds up the i386 version (ie: int64 XOR int64 is slower than int64
XOR int32).

The code layout is a little strange for the string function, but I was
able to get between 5 to 10% improvement over the original version I
started with. The layout affects gcc code generation choices and this way
was fastest on x86 and alpha.

Note that 'CPUTYPE=p3' etc makes a fair difference to this. It is
around 45% faster with -march=pentiumpro on a p6 cpu.


74381 17-Mar-2001 peter

Dramatically improve the **lame** nfs_hash(). This is based on the
Fowler / Noll / Vo Hash (http://www.isthe.com/chongo/tech/comp/fnv/).

This improves hash coverage a *massive* amount. We were seeing one
set of machines that were using 0.84% of their 131072 entry nfsnode
hash buckets with maximum chain lengths of up to ~500 entries. The
machine was spending nearly 100% of its time in 'system'.
A test with this has pushed the coverage from a few perCent up to 91%
utilization with a max chain length of 11.

Submitted by: David Filo


73929 07-Mar-2001 jhb

Grab the process lock while calling psignal and before calling psignal.


73286 01-Mar-2001 adrian

Reviewed by: jlemon

An initial tidyup of the mount() syscall and VFS mount code.

This code replaces the earlier work done by jlemon in an attempt to
make linux_mount() work.

* the guts of the mount work has been moved into vfs_mount().

* move `type', `path' and `flags' from being userland variables into being
kernel variables in vfs_mount(). `data' remains a pointer into
userspace.

* Attempt to verify the `type' and `path' strings passed to vfs_mount()
aren't too long.

* rework mount() and linux_mount() to take the userland parameters
(besides data, as mentioned) and pass kernel variables to vfs_mount().
(linux_mount() already did this, I've just tidied it up a little more.)

* remove the copyin*() stuff for `path'. `data' still requires copyin*()
since its a pointer into userland.

* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each
filesystem. This variable is generally initialised with `path', and
each filesystem can override it if they want to.

* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.


73211 28-Feb-2001 dillon

Fix lockup for loopback NFS mounts. The pipelined I/O limitations could be
hit on the client side and prevent the server side from retiring writes.
Pipeline operations turned off for all READs (no big loss since reads are
usually synchronous) and for NFS writes, and left on for the default bwrite().
(MFC expected prior to 4.3 freeze)

Testing by: mjacob, dillon


72650 18-Feb-2001 green

Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by: bde


71915 02-Feb-2001 tegge

Enable use of DHCP extensions.
Reviewed by: Per Kristian Hove <Per.Hove@math.ntnu.no>


70674 04-Jan-2001 dillon

NFS O_EXCL file create semantics temporarily uses file attributes to store
the file verifier. The NFS client is supposed to do a SETATTR after a
successful O_EXCL open/create to clean up the attributes. FreeBSD's
client code was generating a SETATTR rpc but was not generating an access
or modification time update within that rpc, leaving the file with a
broken access time that solaris chokes on (and it doesn't look very
nice when you ls -lua under FreeBSD either!). Fixed.


70254 21-Dec-2000 bmilekic

* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.

* Fix a typo in a comment in mbuf.h

* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.


69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


69214 26-Nov-2000 phk

Simplify the tprintf() API.

Loose the special <sys/tprintf.h> #include file.


68883 18-Nov-2000 dillon

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


68711 14-Nov-2000 mckusick

In preparation for deprecating CIRCLEQ macros in favor of TAILQ
macros which provide the same functionality and are a bit more
efficient, convert use of CIRCLEQ's in NFS to TAILQ's.


68186 01-Nov-2000 eivind

Give vop_mmap an untimely death. The opportunity to give it a timely
death timed out in 1996.


67882 29-Oct-2000 phk

Remove unneeded #include <sys/proc.h> lines.


67834 29-Oct-2000 tegge

Reduce kernel stack usage by not having large packets on the stack.
Supply correct size parameter to dhcpd.
Replace some magic numbers with macro names.
Handle more than one interface.


67534 24-Oct-2000 tegge

Eliminate some bitrot (nonexisting member variable names).
Don't use curproc when a proc pointer is available.


67531 24-Oct-2000 tegge

Style fixes.


67529 24-Oct-2000 tegge

Make RPC timeout message more readable.
Supply proc pointer to sosend.


67486 24-Oct-2000 dwmalone

Problem to avoid processes getting stuck in "vmopar". From Ian's
mail:

The problem seems to originate with NFS's postop_attr
information that is returned with a read or write RPC.
Within a vm_fault context, the code cannot deal with
vnode_pager_setsize() shrinking a vnode.

The workaround in the patch below stops the nfsm_postop_attr()
macro from ever shrinking a vnode. If the new size in the
postop_attr information is smaller, then it just sets the
nfsnode n_attrstamp to 0 to stop the wrong size getting
used in the future. This change only affects postop_attr
attributes; the nfsm_loadattr() macro works as normal.

The change is implemented by adding a new argument to
nfs_loadattrcache() called 'dontshrink'. When this is
non-zero, nfs_loadattrcache() will never reduce the
vnode/nfsnode size; instead it zeros n_attrstamp.

There remain other was processes can get stuck in vmopar.

Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
Reviewed by: dillon
Tested by: Vadim Belman <voland@lflat.org>


67152 15-Oct-2000 bp

Make nfs PDIRUNLOCK aware. Now it is possible to use nullfs mounts on top
of nfs mounts, but there can be side effects because nfs uses shared locks
for vnodes.


67151 15-Oct-2000 bp

Add missed vop_stdunlock() for fifo's vnops (this affects only v2 mounts).

Give nfs's node lock its own name.


66615 04-Oct-2000 jasone

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


66355 25-Sep-2000 bp

Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by: mckusick


65497 05-Sep-2000 msmith

Don't scan for the "right" network interface by shooting in the dark.
Assume that the nfs_diskless structure is correctly set up; the provider
ought to be getting it right.


63788 24-Jul-2000 mckusick

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


61619 13-Jun-2000 ps

Correctly set the Maximum DHCP Message Size. bootpd now works
again as well as ISC dhcpd.


60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


60141 07-May-2000 phk

Include a RFC 1533 "Maximum DHCP Message Size" option in our request.

ISC DHCP will limit the reply length to 64 bytes for bootp replies
unless we explicitly tell it we can do more. We tell it that we can do
1200 bytes.


60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


59794 30-Apr-2000 phk

Remove unneeded #include <vm/vm_zone.h>

Generated by: src/tools/tools/kerninclude


59762 29-Apr-2000 phk

s/biowait/bufwait/g

Prodded by: several.


59391 19-Apr-2000 phk

Remove ~25 unneeded #include <sys/conf.h>
Remove ~60 unneeded #include <sys/malloc.h>


59249 15-Apr-2000 phk

Complete the bio/buf divorce for all code below devfs::strategy

Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.

CCD not converted yet, casts to struct buf (still safe)

atapi-cd casts to struct buf to examine B_PHYS


58934 02-Apr-2000 phk

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


58710 27-Mar-2000 dillon

Add a sysctl to specify the amount of UDP receive space NFS should
reserve, in maximal NFS packets. Originally only 2 packets worth of
space was reserved. The default is now 4, which appears to greatly
improve performance for slow to mid-speed machines on gigabit networks.

Add documentation and correct some prior documentation.

Problem Researched by: Andrew Gallatin <gallatin@cs.duke.edu>
Approved by: jkh


58349 20-Mar-2000 phk

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


58345 20-Mar-2000 phk

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


57178 13-Feb-2000 peter

Clean up some loose ends in the network code, including the X.25 and ISO
#ifdefs. Clean out unused netisr's and leftover netisr linker set gunk.
Tested on x86 and alpha, including world.

Approved by: jkh


55934 13-Jan-2000 dillon

The alpha build cuases the 'nfsuid bloated' warning to occur. Well,
there is nothing we can do about it. In fact, after further review
there simply are not very many instances of the two structures NFS
checks for 'bloat' so I've decided to simply rip the checks out entirely.

Submitted by: Andrew Gallatin <gallatin@cs.duke.edu>


55679 09-Jan-2000 shin

tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


55431 05-Jan-2000 dillon

Enhance reassignbuf(). When a buffer cannot be time-optimally inserted
into vnode dirtyblkhd we append it to the list instead of prepend it to
the list in order to maintain a 'forward' locality of reference, which
is arguably better then 'reverse'. The original algorithm did things this
way to but at a huge time cost.

Enhance the append interlock for NFS writes to handle intr/soft mounts
better.

Fix the hysteresis for NFS async daemon I/O requests to reduce the
number of unnecessary context switches.

Modify handling of NFS mount options. Any given user option that is
too high now defaults to the kernel maximum for that option rather then
the kernel default for that option.

Reviewed by: Alfred Perlstein <bright@wintelcom.net>


55423 05-Jan-2000 dillon

Fix at least one source of the continued 'NFS append race'. close()
was calling nfs_flush() and then clearing the NMODIFIED bit. This is
not legal since there might still be dirty buffers after the nfs_flush
(for example, pending commits). The clearing of this bit in turn prevented
a necessary vinvalbuf() from occuring leaving left over dirty buffers
even after truncating the file in a new operation. The fix is to
simply not clear NMODIFIED.

Also added a sysctl vfs.nfs.nfsv3_commit_on_close which, if set to 1,
will cause close() to do a stage 1 write AND a stage 2 commit
synchronously. By default only the stage 1 write is done synchronously.

Reviewed by: Alfred Perlstein <bright@wintelcom.net>


55206 29-Dec-1999 peter

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


54970 21-Dec-1999 alfred

make getfh a standard syscall instead of dependant on having
NFSSERVER defined, useful for userland fileservers that want to
use a filehandle type interface to the filesystem.

Submitted by: Assar Westerlund assar@stacken.kth.se
PR: kern/15452


54803 19-Dec-1999 rwatson

Second pass commit to introduce new ACL and Extended Attribute system
calls, vnops, vfsops, both in /kern, and to individual file systems that
require a vfsop_ array entry.

Reviewed by: eivind


54799 19-Dec-1999 green

M_PREPEND-related cleanups (unregisterifying struct mbuf *s).


54655 15-Dec-1999 eivind

Introduce NDFREE (and remove VOP_ABORTOP)


54605 14-Dec-1999 dillon

Fix two problems: First, fix the append seek position race that can
occur due to np->n_size potentially changing if nfs_getcacheblk()
blocks in nfs_write().

Second, under -current we must supply the proper bufsize when obtaining
buffers that straddle the EOF, but due to the fact that np->n_size can
change out from under us it is possible that we may specify the wrong
buffer size and wind up truncating dirty data written by another
process.

Both problems are solved by implementing nfs_rslock(), which allows us
to lock around sensitive buffer cache operations such as those that
occur when appending to a file.

It is believed that this race is responsible for causing dirtyoff/dirtyend
and (in stable) validoff/validend to exceed the buffer size. Therefore
we have now added a warning printf for the dirtyoff/end case in current.

However, we have introduced a new problem which we need to fix at some
point, and that is that soft or intr NFS mounts may become
uninterruptable from the point of view of process A which is stuck waiting
on rslock while process B is stuck doing the rpc. To unstick process A,
process B would have to be interrupted first.

Reviewed by: Alfred Perlstein <bright@wintelcom.net>


54536 13-Dec-1999 dillon

Fix a timeout deadlock that can occur when the process holding the
receive lock hasn't yet managed to send its own request.

PR: kern/15055
Submitted by: Ian Dowse iedowse@maths.tcd.ie


54485 12-Dec-1999 dillon

Fix a number of server-side issues related to aborting badly formed
NFS packets, mainly initializing structure pointers to NULL which
are conditionally freed prior to return.

PR: kern/15249
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


54480 12-Dec-1999 dillon

Synopsis of problem being fixed: Dan Nelson originally reported that
blocks of zeros could wind up in a file written to over NFS by a client.
The problem only occurs a few times per several gigabytes of data. This
problem turned out to be bug #3 below.

bug #1:

B_CLUSTEROK must be cleared when an NFS buffer is reverted from
stage 2 (ready for commit rpc) to stage 1 (ready for write).
Reversions can occur when a dirty NFS buffer is redirtied with new
data.

Otherwise the VFS/BIO system may end up thinking that a stage 1
NFS buffer is clusterable. Stage 1 NFS buffers are not clusterable.

bug #2:

B_CLUSTEROK was inappropriately set for a 'short' NFS buffer (short
buffers only occur near the EOF of the file). Change to only set
when the buffer is a full biosize (usually 8K). This bug has no
effect but should be fixed in -current anyway. It need not be
backported.

bug #3:

B_NEEDCOMMIT was inappropriately set in nfs_flush() (which is
typically only called by the update daemon). nfs_flush()
does a multi-pass loop but due to the lack of vnode locking it
is possible for new buffers to be added to the dirtyblkhd list
while a flush operation is going on. This may result in nfs_flush()
setting B_NEEDCOMMIT on a buffer which has *NOT* yet gone through its
stage 1 write, causing only the commit rpc to be made and thus
causing the contents of the buffer to be thrown away (never sent to
the server).

The patch also contains some cleanup, which only applies to the commit
into -current.

Reviewed by: dg, julian
Originally Reported by: Dan Nelson <dnelson@emsphone.com>


54444 11-Dec-1999 eivind

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


53937 30-Nov-1999 dillon

The symlink implementation could improperly return a NULL vp along with
a 0 error code. The problem occured with NFSv2 mounts and also with
any NFSv3 mount returning an EEXIST error (which is translated to 0
prior to return). The reply to the rpc only contains the file handle
for the no-error case under NFSv3. The error case under NFSv3 and
all cases under NFSv2 do *not* return the file handle. The fix is
to do a secondary lookup to obtain the file handle and thus be able
to generate a return vnode for the situations where the rpc reply
does not contain the required information.

The bug was originally introduced when VOP_SYMLINK semantics were
changed for -CURRENT. The NFS symlink implementation was not properly
modified to go along with the change despite the fact that three
people reviewed the code. It took four attempts to get the current
fix correct with five people. Is NFS obfuscated? Ha!

Reviewed by: Alfred Perlstein <bright@wintelcom.net>
Testing and Discussion: "Viren R.Shah" <viren@rstcorp.com>, Eivind Eklund <eivind@FreeBSD.ORG>, Ian Dowse <iedowse@maths.tcd.ie>


53776 27-Nov-1999 eivind

Remap the error EEXISTS => 0 *before* using error to determine if we should
return a vp.


53552 22-Nov-1999 dillon

nm_srtt and nm_sdrtt are arrays[4]. Remove explicit initialization
of element [4] in both, which goes beyond the end of the array, leaving
[0], [1], [2], and [3]. This bug did not cause any problems since
the overrun fields are initialized after the bogus array init but
needs to be fixed anyway.

Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


53463 20-Nov-1999 eivind

Fix VOP_MKNOD for loss of WILLRELE. I don't know how I could have missed
this in the first place :-(

Noticed by: bde


53131 13-Nov-1999 eivind

Remove WILLRELE from VOP_SYMLINK

Note: Previous commit to these files (except coda_vnops and devfs_vnops)
that claimed to remove WILLRELE from VOP_RENAME actually removed it from
VOP_MKNOD.


53095 11-Nov-1999 dillon

Remove special case socket sharing code in order to allow nfsd to
bind IP addresses to udp/cltp sockets separately.

PR: kern/13049
Reviewed by: David Malone <dwmalone@maths.tcd.ie>, freebsd-current


53022 08-Nov-1999 dillon

Fix nfssvc_addsock() to not attempt to free a NULL socket structure
when returning an error. Bug fix was extracted from the PR. The PR
is not yet entirely resolved by this commit.

PR: kern/13049
Reviewed by: Matt Dillon <dillon@freebsd.org>
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


52781 01-Nov-1999 msmith

Call bootpc_init before we try to mount an NFS root, if we're configured
to use BOOTP for NFS root discovery.

The entire interface setup inside nfs_mountroot is evil, and should die.


52635 29-Oct-1999 phk

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


52492 25-Oct-1999 dillon

Move NFS access cache hits/misses into nfsstats structure so
/usr/bin/nfsstat can get to it easily.


51906 03-Oct-1999 phk

Before we start to mess with the VFS name-cache clean things up a little bit:
Isolate the namecache in its own file, and give it a dedicated malloc type.


51799 29-Sep-1999 marcel

Careless use of struct proc *p caused major problems. 'p' is allowed to
be NULL in this function (nfs_sigintr). Reorder the statements and guard
them all with a single if (p != NULL).

reported, reviewed and tested by: jdp


51791 29-Sep-1999 marcel

sigset_t change (part 2 of 5)
-----------------------------

The core of the signalling code has been rewritten to operate
on the new sigset_t. No methodological changes have been made.
Most references to a sigset_t object are through macros (see
signalvar.h) to create a level of abstraction and to provide
a basis for further improvements.

The NSIG constant has not been changed to reflect the maximum
number of signals possible. The reason is that it breaks
programs (especially shells) which assume that all signals
have a non-null name in sys_signame. See src/bin/sh/trap.c
for an example. Instead _SIG_MAXSIG has been introduced to
hold the maximum signal possible with the new sigset_t.

struct sigprop has been moved from signalvar.h to kern_sig.c
because a) it is only used there, and b) access must be done
though function sigprop(). The latter because the table doesn't
holds properties for all signals, but only for the first NSIG
signals.

signal.h has been reorganized to make reading easier and to
add the new and/or modified structures. The "old" structures
are moved to signalvar.h to prevent namespace polution.

Especially the coda filesystem suffers from the change, because
it contained lines like (p->p_sigmask == SIGIO), which is easy
to do for integral types, but not for compound types.

NOTE: kdump (and port linux_kdump) must be recompiled.

Thanks to Garrett Wollman and Daniel Eischen for pressing the
importance of changing sigreturn as well.


51475 20-Sep-1999 dillon

Add comment to clarify a commit rpc optimization already being performed.


51344 17-Sep-1999 dillon

Asynchronized client-side nfs_commit. NFS commit operations were
previously issued synchronously even if async daemons (nfsiod's) were
available. The commit has been moved from the strategy code to the doio
code in order to asynchronize it.

Removed use of lastr in preparation for removal of vnode->v_lastr. It
has been replaced with seqcount, which is already supported by the system
and, in fact, gives us a better heuristic for sequential detection then
lastr ever did.

Made major performance improvements to the server side commit. The
server previously fsync'd the entire file for each commit rpc. The
server now bawrite()s only those buffers related to the offset/size
specified in the commit rpc.

Note that we do not commit the meta-data yet. This works still needs
to be done.

Note that a further optimization can be done (and has not yet been done)
on the client: we can merge multiple potential commit rpc's into a
single rpc with a greater file offset/size range and greatly reduce
rpc traffic.

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


51138 11-Sep-1999 alfred

Seperate the export check in VFS_FHTOVP, exports are now checked via
VFS_CHECKEXP.

Add fh(open|stat|stafs) syscalls to allow userland to query filesystems
based on (network) filehandle.

Obtained from: NetBSD


51068 07-Sep-1999 alfred

All unimplemented VFS ops now have entries in kern/vfs_default.c that return
reasonable defaults.

This avoids confusing and ugly casting to eopnotsupp or making dummy functions.
Bogus casting of filesystem sysctls to eopnotsupp() have been removed.

This should make *_vfsops.c more readable and reduce bloat.

Reviewed by: msmith, eivind
Approved by: phk
Tested by: Jeroen Ruigrok/Asmodai <asmodai@wxs.nl>


50521 28-Aug-1999 phk

remove unused variables.


50477 28-Aug-1999 peter

$Id$ -> $FreeBSD$


50405 26-Aug-1999 phk

Simplify the handling of VCHR and VBLK vnodes using the new dev_t:

Make the alias list a SLIST.

Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.

Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.

Make the revoke syscalls use vcount() instead of VALIASED.

Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.

vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.

Print the devicename in specfs/vprint().

Remove a couple of stale LFS vnode flags.

Remove unimplemented/unused LK_DRAINED;


50053 19-Aug-1999 peter

Convert all the nfs macros to do { blah } while (0) to ensure it
works correctly in if/else etc. egcs had probably picked up most of the
problems here before with "ambiguous braces" etc, but this should
increase the robustness a bit. Based on an idea from Eivind Eklund.


49945 17-Aug-1999 alc

Add the (inline) function vm_page_undirty for clearing the dirty bitmask
of a vm_page.

Use it.

Submitted by: dillon


49659 12-Aug-1999 dt

nfs_getcacheblk() can return 0 if the mount is interruptible. It need to be
checked by the caller.

Broken in: rev. 1.70 (1999/05/02)


49535 08-Aug-1999 phk

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


49405 04-Aug-1999 peter

Don't over-allocate and over-copy shorter NFSv2 filehandles and then
correct the pointers afterwards.

It's kinda bogus that we generate a 24 (?) byte filehandle (2 x int32
fsid and 16 byte VFS fhandle) and pad it out to 64 bytes for NFSv3 with
garbage. The whole point of NFSv3's variable filehandle length was
to allow for shorter handles, both in memory and over the wire. I plan
on taking a shot at fixing this shortly.


49302 31-Jul-1999 msmith

As described by the submitter:

I did some tcpdumping the other day and noticed that GETATTR calls
were frequently followed by an ACCESS call to the same file. The
attached patch changes nfs_getattr to fill the access cache as a side
effect. This is accomplished by calling ACCESS rather than
GETATTR. This implies a modest overhead of 4 bytes in the request and
8 bytes in the response compared to doing a vanilla GETATTR.
...
[The patch comprises two parts] The first
is the "real" patch, the second counts misses and hits rather than
fills and hits. The difference is subtle but important because both
nfs_getattr and nfs_access now fill the cache. It also changes the
default value of nfsaccess_cache_timeout to better match the attribute
cache. IMHO, file timestamps change much more frequently than
protection bits.

Submitted by: Bjoern Groenvall <bg@sics.se>
Reviewed by: dillon (partially)


49239 30-Jul-1999 wpaul

Close PR #12651: the hash calculation routine has changed in other
parts of the kernel but was not updated in nfs_readdirplusrpc().


49237 30-Jul-1999 wpaul

Fix two bugs in nfs_readdirplus(). The first is that in some cases,
vnodes are locked and never unlocked, which leads to processes starting
to wedge up after doing a mount -o nfsv3,tcp,rdirplus foo:/fs /fs; ls /fs.
The second is that sometimes cnp is accessed without having been
properly initialized: cnp->cn_nameptr points to an earlier name while
"len" contains the length of a current name of different size. This
leads to an attempt to dereference *(cn->cn_nameptr + len) which will
sometimes cause a page fault and a panic.

With these two fixes, client side readdirplus works correctly with
FreeBSD, IRIX 6.5.4 and Solaris 2.5.1 and 2.6 servers.

Submitted by: Matthew Dillon <dillon@backplane.com>


48859 17-Jul-1999 phk

I have not one single time remembered the name of this function correctly
so obviously I gave it the wrong name. s/umakedev/makeudev/g


48394 01-Jul-1999 peter

Fix warning. va_fsid is udev_t, which is int32_t. No need to use %lx.


48357 30-Jun-1999 julian

Submitted by: Conrad Minshall <conrad@apple.com>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>

The following ugly hack to the exit path of nfs_readlinkrpc() circumvents
an Auspex bug: for symlinks longer than 112 (0x70) they return a 1024 byte
xdr string - the correct data with many nulls appended. Without this fix
namei returns ENAMETOOLONG, at least it does on our source base and on
FreeBSD 3.0. Note we do not (and should not) rely upon their null padding.


48317 28-Jun-1999 peter

Fix a KASSERT() that was negated and lead to:
nfs_strategy: buffer 0xxxxx not locked
when you attempted to write and had INVARIANTS turned on.


48274 27-Jun-1999 peter

Minor tweaks to make sure (new) prerequisites for <sys/buf.h> (mostly
splbio()/splx()) are #included in time.


48225 26-Jun-1999 mckusick

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


48125 23-Jun-1999 julian

Matt's NFS fixes.
Submitted by: Matt Dillon
Reviewed by: David Cross, Julian Elischer, Mike Smith, Drew Gallatin
3.2 version to follow when tested


48024 19-Jun-1999 mjacob

Thanks to Bruce for noticing this.... compare against the *new* nfsnode's
mount point for seeing whether or not the new nfsnode is already in the
hash queue. We're pretty much guaranteed that the old nfsnode is already
in the hash queue. Wank! Infinite Loop! Looks like just a minor typo....
(ah the influence of fortran ... np && np2... why not nfsnode_the_first &&
nfsnode_the_second???)...


47964 16-Jun-1999 mckusick

Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.


47938 15-Jun-1999 mjacob

If we retry this operation from the top of this routine, we need to
make sure we've freed any allocated resources (to avoid a memory leak)
and and do the right thing with respect to the nfs node hash lock we'd
acquired.


47751 05-Jun-1999 peter

Various changes lifted from the OpenBSD cvs tree:

txdr_hyper and fxdr_hyper tweaks to avoid excessive CPU order knowledge.

nfs_serv.c: don't call nfsm_adj() with negative values, windows clients
could crash servers when doing a readdir of a large directory.

nfs_socket.c: Use IP_PORTRANGE to get a priviliged port without a spin
loop trying to bind(). Don't clobber a mbuf pointer or we get panics
on a NFS3ERR_JUKEBOX error from a server when reusing a freed mbuf.

nfs_subs.c: Don't loose st_blocks on NFSv2 mounts when > 2GB.

Obtained from: OpenBSD


47750 05-Jun-1999 peter

Fix a malloc race

Obtained from: OpenBSD (csapuntz)


47749 05-Jun-1999 peter

Don't mistake a non-async block that needs to be committed for an
interrupted write.

Obtained from: fvdl@NetBSD.org via OpenBSD.


47028 11-May-1999 phk

Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).


46580 06-May-1999 phk

remove b_proc from struct buf, it's (now) unused.

Reviewed by: dillon, bde


46568 06-May-1999 peter

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


46370 03-May-1999 alc

All directory accesses must be made with NFS_DIRBLKSIZE chunks to avoid
confusing the directory read cookie cache. The nfs_access implementation
for v2 mounts attempts to read from the directory if root is the user
so that root can't access cached files when the server remaps root
to some other user.

Submitted by: Doug Rabson <dfr@nlsystems.com>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>


46349 02-May-1999 alc

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


46112 27-Apr-1999 phk

Suser() simplification:

1:
s/suser/suser_xxx/

2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.


45996 24-Apr-1999 dt

Fixed printf format errors on alpha.


45553 10-Apr-1999 peter

Close a potential mbuf and/or mbuf cluster leak in the client-side NFS
statfs() code. Free the whole chain, not just the first one.


45361 06-Apr-1999 peter

Hold nfsd's upages in-core with PHOLD rather than P_NOSWAP.


45347 05-Apr-1999 julian

Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>


44679 12-Mar-1999 julian

Reviewed by: Many at differnt times in differnt parts,
including alan, john, me, luoqi, and kirk
Submitted by: Matt Dillon <dillon@frebsd.org>

This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.

The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)


44246 25-Feb-1999 peter

Untangle the nfs send and receive queue locking a little. One lock
routine was [ab]used for two different things, and you couldn't tell from
the wait channel which one had wedged.
Catch a few things missing from NFS_NOSERVER.


44112 18-Feb-1999 dfr

Move the declaration of the vfs.nfs sysctl node outside an ifdef so that
it builds if NFS_NOSERVER is defined.

Spotted by: Bruce Evans <bde@zeta.org.au>


44101 17-Feb-1999 bde

Fixed bitrot in NFS_ACDEBUG option.


44078 16-Feb-1999 dfr

* Change sysctl from using linker_set to construct its tree using SLISTs.
This makes it possible to change the sysctl tree at runtime.

* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.

Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)


43960 13-Feb-1999 dillon

General additional cleanup of VOP API for NFS ops - mainly NFS ignoring
the API for freeing up cnp's. This cleanup should not effect nominal
operation one way or the other since NFS VOPs just happen to be called
with flags that match what it actually does to the NAMEI components it
gets. Still, if an NFS error occured, there was probably some memory
leakage of NAMEI components with certain NFS VOP ops.


43956 13-Feb-1999 dillon

PR: kern/9970

Remove incorrect vput() in nfs_link()


43705 06-Feb-1999 dillon

Flush delayed-write data out prior to issuing a rename rpc. This appears
to fix the problem w/ NFSV3 whereby a make installworld would get into
high-network-bandwidth situations continuously trying to retry nfs writes
that fail with a 'stale file handle' error.


43351 28-Jan-1999 dillon

Fix warnings related to -Wall -Wcast-qual


43311 28-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


43309 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile.

This commit includes significant work to proper handle const arguments
for the DDB symbol routines.


43307 27-Jan-1999 dillon

Fix nasty bug in nfs_access(). A conditional was if (a = b) instead of
if (a == b).


43306 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


43305 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


42957 21-Jan-1999 dillon

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


42582 12-Jan-1999 eivind

Remove two cases of unused variable sp3.


42315 05-Jan-1999 eivind

Remove the 'waslocked' parameter to vfs_object_create().


42155 30-Dec-1998 hoek

Silence -Wtrigraph.

Submitted by: Bradley Dunn <bradley@dunn.org> (pr: kern/8817)


42060 25-Dec-1998 dfr

Fix for creating files on a Solaris 7 server with NFSv3 (the request was
slightly garbled but older servers seemed to understand it).

Reviewed by: David O'Brien <obrien@nuxi.ucdavis.edu>


41796 14-Dec-1998 dt

Added 3 new errno values, requred by various standards: EOVERFLOW,
ECANCELED, EILSEQ.

Fixed ibcs2 and especially linux EIDRM and ENOMSG errno mapping.
Reviewed by: Dan Nelson <dnelson@emsphone.com>


41791 14-Dec-1998 dt

(Hopefully) fix support for "large" files. Mostly cast block numbers to off_t
before they multiplied to block sizes.


41591 07-Dec-1998 archie

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


41514 04-Dec-1998 archie

Examine all occurrences of sprintf(), strcat(), and str[n]cpy()
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.

These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>


41488 03-Dec-1998 dillon

Make bootp error message slightly more verbose


41186 15-Nov-1998 msmith

Reimplement the NFS ACCESS RPC cache as an "accelerator" rather than a true
cache. If the cached result lets us say "yes", then go with that. If
we're not sure, or we think the answer might be "no", go to the wire to be
certain. This avoids all of the possible false veto cases, and allows us
to key the cached value with just the UID for which the cached value holds,
reducing the bloat of the nfsnode structure from 104 bytes to just 12 bytes.

Since the "yes" case is by far the most common, this should still provide
a substantial performance improvement. Also default the cache to on, with
a conservative timeout (2 seconds). This improves performance if NFS is
loaded as a KLD module, as there's not (yet) code to parse an option out
of the module arguments to set it, and sysctl doesn't work (yet) for OIDs
in modules.

The 'accelerator' mode was suggested by Bjoern Groenvall (bg@sics.se)

Feedback on this would be appreciated as testing has been necessarily
limited by Comdex, and it would be valuable to have this in 2.2.8.


41138 13-Nov-1998 msmith

Avoid a null pointer reference if the target of an NFS rename has been
sillrenamed, or if the source vnode doesn't have an associated nfsnode.

Bug report from Andrew Gallatin <gallatin@cs.duke.edu>


41132 13-Nov-1998 dfr

Fix a panic in nfsrv_dorec() where a NULL pointer could be passed to
free() sometimes.

Reviewed by: Eric Haug <ejh@eas.slu.edu>


41127 13-Nov-1998 msmith

Implement NFS ACCESS RPC result caching.

This yields startling performance increases for NFS clients for many
access profiles, due to the fact that ACCESS results are persistently
cached in the namecache in many cases.

Note that the code is somewhat conservative in that it requires an
exact credential match for a cache hit. This bloats the nfsnode
structure by sizeof(struct ucred) (96 bytes). Any less conservative
approach opens the possibility for a false veto in eg. setuid
applications. Alternative suggestions would be welcomed.

The cache is normally disabled, to activate set the sysctl variable
vfs.nfs.access_cache_timeout to a nonzero value. This is the time in
seconds that a cached entry will be considered valid; useful values appear
to be 2-10 seconds. Performance of the cache can be monitored with the
vfs.nfs.access_cache_hits and vfs.nfs.access_cache_hits variables.


41026 09-Nov-1998 peter

Remove [apparently] bogus casts to u_long for the vnode_pager_setsize()
second argument. np_size is a 64 bit int, so is the second arg. This
might have caused needless 2G/4G file size problems.

I believe it was Bruce who queried this.


40790 31-Oct-1998 peter

Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.


39794 29-Sep-1998 mckusick

In nfs_link(), check for a cross-device mount *before* looking
in the v_data field.
Obtained from: Charles Hannum, via Frank van der Linden <frank@wins.uva.nl>


39793 29-Sep-1998 mckusick

Missing vput when cross-device link error is detected in nfs_link.


39792 29-Sep-1998 mckusick

During truncation, have to notify the VM about the new size
of the NFS file *before* doing the nfs_vinvalbuf operation.
Otherwise some invalid data may show up in an mmap.


39790 29-Sep-1998 mckusick

Frank sez: 'It fixes a problem with servers that return 0 values
for some of the fsinfo RPC fields. It is strictly speaking not
wrong to do this, as the spec says that "it is expected that a
server will make a best effort at supporting all the attributes",
but pretty unusual. You guessed it, it's NT servers that do it.'
Obtained from: Frank van der Linden <frank@wins.uva.nl>


39789 29-Sep-1998 mckusick

Do not need (or want) to take a reference on an NFS file that
is being deleted due to an forcible unmount. The problem is
that vgone calls vclean() which then calls calls nfs_inactive()
with VXLOCK set on the vnode. Nfs_inactive() was calling vget()
to get a reference on the vnode, which in turn hung on VXLOCK.
Nfs_inactive() now checks v_usecount to make sure that the vnode
is not coming from vclean() before it does a vget().


39788 29-Sep-1998 mckusick

The code checks each fragment mark to see if it's valid; if the fragment
is less than NFS_MINPACKET or greater than NFS_MAXPACKET in size, it
barfs and, I think, drops the connection.

However, there's no guarantee that in a multi-fragment RPC, all the
fragments will be at least as large as NFS_MINPACKET.

In fact, with the version of "tclnfs" we have here, which supports NFS
over TCP, at least when built under SunOS 4.1.3 (i.e., with 4.1.3's
user-mode ONC RPC library), I can *repeatably* cause "tclnfs" to send a
request with more than one fragment, one of which is only 8 bytes long.
I just do a 3877-byte write to a file, at an offset of 0.

The check that "slp->ns_reclen" is greater than or equal to
NFS_MINPACKET serves no useful purpose - if the NFS server code can't
handle packets < NFS_MINPACKET bytes, it can't handle them over *any*
protocol, so the check has to be done above the RPC-over-TCP layer - and
should be removed.
Obtained from: Fix from Guy Harris, forwarded by Rick Macklem.


39782 29-Sep-1998 mckusick

Mark directory buffers that have no valid data with B_INVAL
so that they are not put in the cache.


39781 29-Sep-1998 mckusick

When adding data to a buffer, we need to clear the B_NEEDCOMMIT flag
which says that the data is on server but not committed.


38909 07-Sep-1998 bde

Removed statically configured mount type numbers (MOUNT_*) and all
references to them.

The change a couple of days ago to ignore these numbers in statically
configured vfsconf structs was slightly premature because the cd9660,
cfs, devfs, ext2fs, nfs vfs's still used MOUNT_* instead of the number
in their vfsconf struct.


38894 07-Sep-1998 bde

Made unloading of the nfs LKM sort of work. This is mainly to test
detachment of vfs sysctls. Unloading of vfs LKMs doesn't actually
work for any vfs, since it leaves garbage pointers to memory
allocation control structures.


38869 05-Sep-1998 bde

Ignore the statically configured vfs type numbers and assign vfs
type numbers in vfs attach order (modulo incomplete reuse of old
numbers after vfs LKMs are unloaded). This requires reinitializing
the sysctl tree (or at least the vfs subtree) for vfs's that support
sysctls (currently only nfs). sysctl_order() already handled
reinitialization reasonably except it checked for annulled self
references in the wrong place.

Fixed sysctls for vfs LKMs.


38866 05-Sep-1998 bde

Instantiate `nfs_mount_type' in a standard file so that it is present
when nfs is an LKM. Declare it in a header file. Don't forget to use
it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0
will soon be the mount type number for the first vfs loaded.

NetBSD uses strcmp() to avoid this ugly global.


38799 04-Sep-1998 dfr

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


38718 01-Sep-1998 luoqi

Check for NULL pointer before freeing a struct sockaddr. m_freem() can handle
NULL, buf free() can't.


38482 23-Aug-1998 wollman

Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.


38412 18-Aug-1998 bde

Fixed printf format errors.


38299 13-Aug-1998 dfr

Protect all modifications to v_numoutput with splbio().


38289 12-Aug-1998 bde

Don't configure compatibility code for pre-Lite2 mount() calls by
default. This code should go away soon.


37997 01-Aug-1998 peter

If we get an ENOBUFS from the network, it's normally transient network
interface congestion (eg: nfs over a ppp link, etc). Don't log these
for UDP mounts, and don't cause syscalls to fail with EINTR.
This stops the 'nfs send error 55' warnings.

If the error is because the system is really hosed, this is the least
of your problems...


37649 15-Jul-1998 bde

Cast pointers to uintptr_t/intptr_t instead of to u_long/long,
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.


37384 04-Jul-1998 julian

VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>


37291 30-Jun-1998 jmg

fix buildworld hopefully be3fore anyone complains...

NFS_*TIMO should possibly be converted to sysctl vars (jkh's suggestion),
but in some cases it looks like nfs keeps a copy of the value in a struct

hash sizes are already ifdef'd KERNEL, so there aren't userland inpact
from them...


37272 30-Jun-1998 jmg

convert some nfs tunables to options, these are:
NFS_MINATTRTIMO VREG attrib cache timeout in sec
NFS_MAXATTRTIMO
NFS_MINDIRATTRTIMO VDIR attrib cache timeout in sec
NFS_MAXDIRATTRTIMO
NFS_GATHERDELAY Default write gather delay (msec)
NFS_UIDHASHSIZ Tune the size of nfssvc_sock with this
NFS_WDELAYHASHSIZ and with this
NFS_MUIDHASHSIZ Tune the size of nfsmount with this
NFS_NOSERVER (already documented in LINT)
NFS_DEBUG turn on NFS debugging

also, because NFS_ROOT is used by very different files, it has been
renamed to opt_nfsroot.h instead of the old opt_nfs.h....


37089 21-Jun-1998 bde

Fixed typo in ifdefed code. (NFS_ACDEBUG is not in LINT. Therefore,
code controlled by it did not even compile.)


36979 14-Jun-1998 bde

Avoid an egcs pessimization for 64-bit signed division on i386's.
Pre-2.8 versions of gcc generate a call to __divdi3() for all 64-bit
signed divisions, but egcs optimizes them to a shift and fixup when
the divisor is a constant power of 2. Unfortunately, it generates
a call to __cmpdi2() for the fixup, although all except possibly
ancient versions of gcc and egcs do ordinary 64-bit comparisons
inline.


36735 07-Jun-1998 dfr

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


36563 01-Jun-1998 peter

Make sure we go a nfs_fsinfo() in get/putpages before calling
readrpc/writerpc, since they assume it's already been done. This could
break if the first read/write access to a nfs filesystem was an exec() or
mmap() instead of a read(), write() syscall. (or statfs()).
nfs_getpages() could return an errno (EOPNOTSUPP) instead of a VM_PAGER_*
return code. Some layout tweaks for the get/putpages code.


36562 01-Jun-1998 peter

Fix post-test pre-commit cleanup typo.


36561 01-Jun-1998 peter

readlink() returns EINVAL rather than EPERM if called on a non-symlink.


36560 01-Jun-1998 peter

Preset the maximum file size before we get to nfs_fsinfo(), based on
an (over?) conservative assumption about what the client can store in it's
buffer cache using a signed 32-bit 512-byte block number index. Otherwise
it's possible for some file access when maxfilesize = 0 (eg: /usr is nfs
mounted and doing an execve())
Pointed out by: bde

XXX It might make sense to do a preemptive nfs_fsinfo() call at mount time.


36541 31-May-1998 peter

For the on-the-wire protocol, u_long -> u_int32_t; long -> int32_t;
int -> int32_t; u_short -> u_int16_t. Also, use mode_t instead of u_short
for storing modes (mode_t is a u_int16_t).

Obtained from: NetBSD


36540 31-May-1998 peter

Support 'mount -u' remounts. This may require disconnecting and rebinding
the socket. Certain mode changes are not allowed.

Obtained from: NetBSD


36538 31-May-1998 peter

xdr encode -1 properly.

Obtained from: NetBSD


36537 31-May-1998 peter

Fully fill in nfsv2 write rpc requests rather than leaving garbage.

Obtained from: NetBSD


36536 31-May-1998 peter

Don't silently fail to set file flags.

Obtained from: NetBSD


36535 31-May-1998 peter

Don't blindly accept the server's preferences if they are too small.

Obtained from: NetBSD


36534 31-May-1998 peter

Prototype support for selectively allowing non-reserved ports on a per
export basis. Needs userland support yet.

Obtained from: NetBSD


36531 31-May-1998 peter

Don't pass a second copy of the uid/gid in with the v2/v3 sattr structures,
it just makes more work. We pass a copy of the uid/gid with the
credentials. (although, this may need to be revisited if a non AUTHUNIX
authentication method (such as NFSKERB) ever gets implemented).

Obtained from: NetBSD


36530 31-May-1998 peter

Use the new SB_UPCALL flag,

Obtained from: NetBSD (but I changed the flag clear order in case).


36526 31-May-1998 peter

NFS_SMALLFH is defined in nfsproto.h, not sys/mount.h

Obtained from: NetBSD


36525 31-May-1998 peter

Don't let the user try "rmdir ."

Obtained from: NetBSD


36524 31-May-1998 peter

Don't let the user try and unlink() a directory on a NFS server.

Obtained from: NetBSD


36523 31-May-1998 peter

When a write rpc returns an error, break the loop.

Obtained from: NetBSD


36522 31-May-1998 peter

Don't leak an mbuf when a write rpc returns zero bytes written.

Obtained from: NetBSD


36521 31-May-1998 peter

#ifdef a diagnostic printf

Obtained from: NetBSD


36520 31-May-1998 peter

Don't try and free mrep twice on some error conditions.

Obtained from: NetBSD


36519 31-May-1998 peter

#ifdef a diagnostic panic, plus another missed costmetic change.

Obtained from: NetBSD


36518 31-May-1998 peter

We have gained 2 more errno's, add them to the NFSv2 mapping table.


36517 31-May-1998 peter

Missed a cosmetic change that the other BSD's have.


36516 31-May-1998 peter

oops, nfs_msg() is called from client code too.


36515 31-May-1998 peter

When we can't reconnect a socket, don't forget to unlock before retrying
or we can deadlock.

Obtained from: NetBSD


36514 31-May-1998 peter

Don't log zero length reads, this can happen during normal operation.

Obtained from: NetBSD


36513 31-May-1998 peter

Consider for readdir chunk sizes when tuning socket buffer reservations.

Obtained from: NetBSD


36511 31-May-1998 peter

Some const's

Obtained from: NetBSD


36503 31-May-1998 peter

NFS Jumbo commit part 1. Cosmetic and structural changes only. The aim
of this part of commits is to minimize unnecessary differences between
the other NFS's of similar origin. Yes, there are gratuitous changes here
that the style folks won't like, but it makes the catch-up less difficult.


36481 31-May-1998 peter

VOP_ABORTUP() appears to be called with the wrong vnode. The other callers
that I checked (eg: ufs_link()) do the ABORTOP on the directory rather than
the file itself. After Michael Hancock's patches, the abortop doesn't seem
all that critial now since something else will free the pathname buffer.


36473 30-May-1998 peter

When using NFSv3, use the remote server's idea of the maximum file size
rather than assuming 2^64. It may not like files that big. :-)
On the nfs server, calculate and report the max file size as the point
that the block numbers in the cache would turn negative.
(ie: 1099511627775 bytes (1TB)).

One of the things I'm worried about however, is that directory offsets
are really cookies on a NFSv3 server and can be rather large, especially
when/if the server generates the opaque directory cookies by using a local
filesystem offset in what comes out as the upper 32 bits of the 64 bit
cookie. (a server is free to do this, it could save byte swapping
depending on the native 64 bit byte order)

Obtained from: NetBSD


36329 24-May-1998 peter

Convert a couple of large allocations to use zones rather than malloc
for better packing. This means that we can choose better values for the
various hash entries without having to try and get it all to fit within
an artificial power of two limit for malloc's sake.


36249 20-May-1998 peter

s/flags/flag/


36248 20-May-1998 peter

A cleaner fix for PR#5102, clear nonsense flags at mount time rather than
in the core of nfs_bio.c at the 11th hour.

PR: 5102


36247 20-May-1998 peter

Don't change argp->flags after it's been copied.


36176 19-May-1998 peter

Allow control of the attribute cache timeouts at mount time.

We had run out of bits in the nfs mount flags, I have moved the internal
state flags into a seperate variable. These are no longer visible via
statfs(), but I don't know of anything that looks at them.


36100 16-May-1998 bde

Get timespecs directly instead of via timevals.


36099 16-May-1998 bde

Don't abuse `+' to combine flags.


36098 16-May-1998 bde

Backed out rev.1.76. It just added style bugs.


36097 16-May-1998 bde

Get timespecs directly instead of via timevals.


36013 13-May-1998 peter

Add missing arg to vget().. Serves me right for committing a 2.2 patch to
-current without testing it there.. :-(

Submitted by: Michael Hancock <michaelh@cet.co.jp>


35994 13-May-1998 peter

Hold a reference to the vnode during the sillyrename cleanup. If we block
in nfs_vinvalbuf() or the nfs_removeit(), we can have the nfsnode reallocated
from underneath us (eg: replaced by a ufs 'struct inode') which can cause
disk corruption ('freeing free block' when di_db[5] gets trashed).
This is not a cheap fix, but it'll do until the nfsnodes get reference
counting and/or locking.

Apparently NetBSD have a similar fix (apparently from BSDI).

I wish all PR's had this much useful detail. :-)

PR: 6611
Submitted by: Stephen Clawson <sclawson@marker.cs.utah.edu>


35991 13-May-1998 peter

Move the *vpp initialization earlier so that it's set in all error cases.
This should stop the 'panic: leaf should not be empty' nfs panic.

PR: 1856
Submitted by: msaitoh@spa.is.uec.ac.jp


35823 07-May-1998 msmith

In the words of the submitter:

---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.

Quality testing was done with testvn, and lat_fs from the lmbench suite.

Some NFS client testing courtesy of Patrik Kudo.

vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------

Submitted by: Michael Hancock <michaelh@cet.co.jp>


35769 06-May-1998 msmith

As described by the submitter:

Reverse the VFS_VRELE patch. Reference counting of vnodes does not need
to be done per-fs. I noticed this while fixing vfs layering violations.
Doing reference counting in generic code is also the preference cited by
John Heidemann in recent discussions with him.

The implementation of alternative vnode management per-fs is still a valid
requirement for some filesystems but will be revisited sometime later,
most likely using a different framework.

Submitted by: Michael Hancock <michaelh@cet.co.jp>


35066 06-Apr-1998 phk

Use random() to find our initial xid.


34961 30-Mar-1998 phk

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


34930 28-Mar-1998 steve

Don't allow the readdirplus routine to be used in NFS V2.

PR: 5102
Reviewed by: msmith
Submitted by: Dmitry Kohmanyuk <dk@farm.org>


34926 28-Mar-1998 bde

Don't depend on <sys/mount.h> including <sys/socket.h>.


34924 28-Mar-1998 bde

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


34573 14-Mar-1998 tegge

Add a BOOTP_WIRED_TO option, for use on machines with multiple network
cards where the first detected card should not be used for bootp.
Submitted by: Doug Ambrisko <ambrisko@whistle.com>


34572 14-Mar-1998 tegge

Update workaround for limitations in the arp code.
Adjust the RPC timeout message which occured when the old workaround
broke to show the correct IP address.


34266 08-Mar-1998 julian

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


34206 07-Mar-1998 dyson

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


34096 06-Mar-1998 msmith

Trivial filesystem getpages/putpages implementations, set the second.
These should be considered the first steps in a work-in-progress.
Submitted by: Terry Lambert <terry@freebsd.org>


33964 01-Mar-1998 msmith

The intent is to get rid of WILLRELE in vnode_if.src by making
a complement to all ops that return a vpp, VFS_VRELE. This is
initially only for file systems that implement the following ops
that do a WILLRELE:

vop_create, vop_whiteout, vop_mknod, vop_remove, vop_link,
vop_rename, vop_mkdir, vop_rmdir, vop_symlink

This is initial DNA that doesn't do anything yet. VFS_VRELE is
implemented but not called.

A default vfs_vrele was created for fs implementations that use the
standard vnode management routines.

VFS_VRELE implementations were made for the following file systems:

Standard (vfs_vrele)
ffs mfs nfs msdosfs devfs ext2fs

Custom
union umapfs

Just EOPNOTSUPP
fdesc procfs kernfs portal cd9660

These implementations may change as VOP changes are implemented.

In the next phase, in the vop implementations calls to vrele and the vrele
part of vput will be moved to the top layer vfs_vnops and made visible
to all layers. vput will be replaced by unlock in these cases. Unlocking
will still be done in the per fs layer but the refcount decrement will be
triggered at the top because it doesn't hurt to hold a vnode reference a
little longer. This will have minimal impact on the structure of the
existing code.

This will only be done for vnode arguments that are released by the various
fs vop implementations.

Wider use of VFS_VRELE will likely require restructuring of the code.

Reviewed by: phk, dyson, terry et. al.
Submitted by: Michael Hancock <michaelh@cet.co.jp>


33181 09-Feb-1998 eivind

Staticize.


33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


33121 05-Feb-1998 dyson

Fix an omission of a line from the previous commit to this file. The
problem appeared to be an NFS hang.


33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


33054 03-Feb-1998 bde

Forward declare some structs so that this file is more self-sufficient.


32998 01-Feb-1998 bde

Moved declaration of `union nethostadr' outside of the KERNEL section,
to give pollution compatible with <nfs/nqfs.h>. At least mount_nfs.c
previously had to #define KERNEL before including <nfs/nfs.h> to get
this pollution, but this gave other pollution.

Moved comment about NFSINT_SIGMASK to immediately before the code that
it applies to.


32997 01-Feb-1998 bde

Forward declare more structs that are used in prototypes here - don't
depend on <sys/types.h> forward declaring common ones.

Added an underscore to `sin' in prototypes to avoid warnings for the
conflict with the ANSI sin().


32912 31-Jan-1998 tegge

Release the buffer when an error occurs while reading directory entries.


32755 25-Jan-1998 dyson

Various NFS fixes:
Make vfs_bio buffer mgmt work better.
Buffers were being used after brelse.
Make nfs_getpages work independently of other NFS
interfaces. This eliminates some difficult
recursion problems and decreases pagefault
overhead.
Remove an erroneous vfs_unbusy_pages.
Fix a reentrancy problem, with nfs_vinvalbuf when
vnode is already being rundown.
Reassignbuf wasn't being called when needed under
certain circumstances.

(Thanks to Bill Paul for help.)


32754 25-Jan-1998 dyson

Various NFS fixes:
Make vfs_bio buffer mgmt work better.
Buffers were being used after brelse.
Make nfs_getpages work independently of other NFS
interfaces. This eliminates some difficult
recursion problems and decreases pagefault
overhead.
Remove an erroneous vfs_unbusy_pages.
Fix a reentrancy problem, with nfs_vinvalbuf when
vnode is already being rundown.
Reassignbuf wasn't being called when needed under
certain circumstances.

(Thanks for help from Bill Paul.)


32609 18-Jan-1998 tegge

Increase the minimum bootp reply packet size from 16 (bogus) to 300 (correct).


32358 09-Jan-1998 eivind

Make the BOOTP family new-style options (in opt_bootp.h)


32350 08-Jan-1998 eivind

Make INET a proper option.

This will not make any of object files that LINT create change; there
might be differences with INET disabled, but hardly anything compiled
before without INET anyway. Now the 'obvious' things will give a
proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The
only thing that _should_ work (but can't be made to compile reasonably
easily) is sppp :-(

This commit move struct arpcom from <netinet/if_ether.h> to
<net/if_arp.h>.


32286 06-Jan-1998 dyson

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


32071 29-Dec-1997 dyson

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


32011 27-Dec-1997 bde

Unspammed nested include of <vm/vm_zone.h>.


31886 20-Dec-1997 bde

Added a used include.

Fixed a gratuitous ANSIism and nearby KNF violations.


31617 08-Dec-1997 dyson

Various of the ISP users have commented that the 1.41 version of the
nfs_bio.c code worked better than the 1.44. This commit reverts
the important parts of 1.44 to 1.41, and we will fix it when we
can get a handle on the problem.


31391 24-Nov-1997 bde

Don't call malloc(..., M_WAITOK) at splnet(). Doing so is often
a mistake (since softnet interrupts may occur if malloc() waits),
and doing it harmlessly but unnecessarily here interfered with
detection of the mistaken cases.


31132 12-Nov-1997 julian

Reviewed by: various.

Ever since I first say the way the mount flags were used I've hated the
fact that modes, and events, internal and exported, and short-term
and long term flags are all thrown together. Finally it's annoyed me enough..
This patch to the entire FreeBSD tree adds a second mount flag word
to the mount struct. it is not exported to userspace. I have moved
some of the non exported flags over to this word. this means that we now
have 8 free bits in the mount flags. There are another two that might
well move over, but which I'm not sure about.
The only user visible change would have been in pstat -v, except
that davidg has disabled it anyhow.
I'd still like to move the state flags and the 'command' flags
apart from each other.. e.g. MNT_FORCE really doesn't have the
same semantics as MNT_RDONLY, but that's left for another day.


31017 07-Nov-1997 phk

Rename some local variables to avoid shadowing other local variables.

Found by: -Wshadow


31016 07-Nov-1997 phk

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


30994 06-Nov-1997 phk

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


30813 28-Oct-1997 bde

Removed unused #includes.


30808 28-Oct-1997 bde

Don't #include <nfs/nfs.h> in <nfs/nfs_node.h> if KERNEL is defined.
Fixed everything that depended on the nested include.


30780 27-Oct-1997 bde

Removed unused #includes. The need for most of them went away with
recent changes (docluster* and vfs improvements).


30743 26-Oct-1997 phk

VFS interior redecoration.

Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.


30738 26-Oct-1997 phk

Always initialize the syscall vectors for our "private" syscalls (not
just in the LKM case).
Plug nqnfs_vop_lease_check directly into the default_vnodeop_p table.


30496 16-Oct-1997 phk

VFS clean up "hekto commit"

1. Add defaults for more VOPs
VOP_LOCK vop_nolock
VOP_ISLOCKED vop_noislocked
VOP_UNLOCK vop_nounlock
and remove direct reference in filesystems.

2. Rename the nfsv2 vnop tables to improve sorting order.


30492 16-Oct-1997 phk

Another VFS cleanup "kilo commit"

1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.

2. Remove VOP_SEEK, it was unused.

3. Add mode default vops:

VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp

And remove identical functionality from filesystems

4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)

5. Try to make system wide VOP functions have vop_* names.

6. Initialize the um_* vectors in LFS.

(Recompile your LKMS!!!)


30474 16-Oct-1997 phk

VFS mega cleanup commit (x/N)

1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.

2. Change VOP_BLKATOFF to a normal function in cd9660.

3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.

4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.

5. Fix another VCALL in vfs_cache.c (thanks Bruce!)


30439 15-Oct-1997 phk

vnops megacommit

1. Use the default function to access all the specfs operations.
2. Use the default function to access all the fifofs operations.
3. Use the default function to access all the ufs operations.
4. Fix VCALL usage in vfs_cache.c
5. Use VOCALL to access specfs functions in devfs_vnops.c
6. Staticize most of the spec and fifofs vnops functions.
7. Make UFS panic if it lacks bits of the underlying storage handling.


30434 15-Oct-1997 phk

Hmm, realign the vnops into two columns.


30431 15-Oct-1997 phk

Stylistic overhaul of vnops tables.
1. Remove comment stating the blatantly obvious.
2. Align in two columns.
3. Sort all but the default element alphabetically.
4. Remove XXX comments pointing out entries not needed.


30430 15-Oct-1997 phk

When the default vnops funtion is vn_default_error(), there is no reason to
implement small functions that just return EOPNOTSUPP for things we don't do.

The removed functions only apply to UFS based filesystems anyway.


30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


30309 11-Oct-1997 phk

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


30118 05-Oct-1997 phk

Reverse rev 1.56 and rev 1.59. These made NFS too flakey.


29653 21-Sep-1997 dyson

Change the M_NAMEI allocations to use the zone allocator. This change
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.


29363 14-Sep-1997 peter

select -> poll
flag missing vnode op table entries


29293 10-Sep-1997 phk

Don't repeat checks done at general level.


29291 10-Sep-1997 phk

Remove a couple of stubborn NetBSD #if's.


29288 10-Sep-1997 phk

unifdef -U__NetBSD__ -D__FreeBSD__


29202 07-Sep-1997 bde

Removed more vestiges of config-time swap configuration.


29024 02-Sep-1997 bde

Added used #include - don't depend on <sys/mbuf.h> including
<sys/malloc.h> (unless we only use the bogusly shared M*WAIT flags).


28787 26-Aug-1997 phk

Uncut&paste cache_lookup().

This unifies several times in theory indentical 50 lines of code.

The filesystems have a new method: vop_cachedlookup, which is the
meat of the lookup, and use vfs_cache_lookup() for their vop_lookup
method. vfs_cache_lookup() will check the namecache and pass on
to the vop_cachedlookup method in case of a miss.

It's still the task of the individual filesystems to populate the
namecache with cache_enter().

Filesystems that do not use the namecache will just provide the
vop_lookup method as usual.


28270 16-Aug-1997 wollman

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


27845 02-Aug-1997 bde

Removed unused #includes.


27609 22-Jul-1997 dfr

Correct some dumb mistakes in the WebNFS stuff.

Submitted by: bde


27446 16-Jul-1997 dfr

Merge WebNFS changes from NetBSD.

Obtained from: NetBSD


26995 27-Jun-1997 wpaul

Fix a condition where nfs_statfs() can precipitate a panic. There is
code that says this:

nfsm_request(vp, NFSPROC_FSSTAT, p, cred);
if (v3)
nfsm_postop_attr(vp, retattr);
if (!error)
nfsm_dissect(sfp, struct nfs_statfs *, NFSX_STATFS(v3));

The problem here is that if error != 0, nfsm_dissect() will not be
called, which leaves sfp == NULL. But nfs_statfs() does not bail out
at this point: it continues processing until it tries to dereference
sfp, which causes a panic. I was able to generate this crash under
the following conditions:

1) Set up a machine as an NFS server and NFS client, with amd running
(using NIS maps). /usr/local is exported, though any exported fs
can can be used to trigger the bug.
2) Log in as normal user, with home directory mounted from a SunOS 4.1.3
NFS server via amd (along with a few other NFS filesystems from same
machine).
3) Su to root and type the following:
# mount localhost:/usr/local /mnt
# df

To fix the panic, I changed the code to read:

if (!error) {
nfsm_dissect(sfp, struct nfs_statfs *, NFSX_STATFS(v3));
} else
goto nfsmout;

This is a bit kludgy in that nfsmout is a label defined by the nfsm_subs.h
macros, but these macros are themselves more than a little kludgy. This
stops the machine from crashing, but does not fix the overall bug: 'error'
somehow becomes 5 (EIO) when a statfs() is performed on the locally mounted
NFS filesystem. This seems to only happen the first time the filesystem
is accesed: on subsequent accesses, it seems to work fine again.

Now, I know there's no practical use in mounting a local filesystem
via NFS, but doing it shouldn't cause the system to melt down.


26952 25-Jun-1997 tegge

Clear nfs_iodwant[myiod] when the nfsiod process exits due to a signal.


26929 25-Jun-1997 dfr

Avoid small synchronous writes when an application does lots of random-access
short writes within a block (e.g. ld).


26928 25-Jun-1997 dfr

Make nfs_lookup return a NULLVP on error so that DIAGNOSTIC kernels don't
panic.


26669 16-Jun-1997 dyson

Upgrade NFS to support the new vfs_bio resource/buffer management.


26581 12-Jun-1997 tegge

Move commonly used code into static functions in order to reduce kernel bloat.


26580 12-Jun-1997 tegge

Remove unused routines.


26469 06-Jun-1997 dfr

Fix a problem caused by removing large numbers of files from a directory
which could cause a bad size to be given to uiomove, causing a page fault.


26420 03-Jun-1997 dfr

Various fixes from NetBSD:

Use u_int for rpc procedure numbers.
Some fixes to NQNFS.
A rare NULL pointer dereference.
Ignore NFSMNT_NOCONN for TCP mounts.

Obtained from: NetBSD


26418 03-Jun-1997 dfr

Implement the async mount option for NFSv3. This makes NFS pretend that all
writes sent to the server were synchronous and therefore no commits are
needed. This is the same as the vfs.nfs.async variable on the server but
allows each client to choose whether to work this way.

Also make the vfs.nfs.async variable do the 'right' thing for NFSv3, i.e.
pretend that the write was synchronous.


26410 03-Jun-1997 dfr

Fix a problem with nfs_flush where if many B_NEEDCOMMIT buffers are
attached to the vnode, some of them could be re-written synchronously
(if they overflowed the fixed size array nfs_flush had for them). The
fix involves mallocing an array if there are more than its limited
size stack buffer.

Reviewed by: Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>


26409 03-Jun-1997 dfr

Fix some performance problems with the NFS mmap fixes.


25952 20-May-1997 dfr

Plug a memory leak in nfs_link.

PR: kern/1001


25930 19-May-1997 dfr

Fix a few bugs with NFS and mmap caused by NFS' use of b_validoff
and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully
can be tidied up later by a slight redesign.

PR: kern/2573, kern/2754, kern/3046 (possibly)
Reviewed by: dyson


25877 17-May-1997 phk

Remove redundant check for vp == dvp (done in VFS before calling).


25804 14-May-1997 tegge

Use same syntax as netboot for root and swap mounts.
Handle mount options.
Ignore T16 (swap server address) and T6 (DNS server).


25785 13-May-1997 dfr

Check the B_CLUSTER flag when choosing whether to use unstable or filesync
writes.

PR: kern/3438
Submitted by: Tor Egge <Tor.Egge@idi.ntnu.no>


25781 13-May-1997 dfr

Don't keep addresses in mbuf chains. This should simplify the next round
of network changes from Garret.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


25755 12-May-1997 tegge

Use the old nfs arguments in the nfs_diskless structure, to be
compatible with boot proms made from the 2.2 source.
Convert the nfs arguments when copying to the new diskless structure.
Copy the gateway field in the diskless structure.


25723 11-May-1997 tegge

Bring in some kernel bootp support. This removes the need for netboot
to fill in the nfs_diskless structure, at the cost of some kernel
bloat. The advantage is that this code works on a wider range of
network adapters than netboot. Several new kernel options are
documented in LINT.
Obtained from: parts of the code comes from NetBSD.


25664 10-May-1997 dfr

Implement a separate control for write gathering on NFSv3. This is turned
off for NFSv3 by default since write gathering seems to reduce performance
for NFSv3 by up to 60%.

Add sysctl knobs to control both variables.


25663 10-May-1997 dfr

Fix a nasty hang connected with write gathering. Also add debug print
statements to bits of the server which helped me find the hang.


25611 09-May-1997 dfr

Prevent a mapped root which appears on the server as e.g. nobody from
accessing files which it shouldn't be able to. This required a better
approximation of VOP_ACCESS for NFSv2 (NFSv3 already has an ACCESS rpc
which is a better solution) and adding a call to VOP_ACCESS from VOP_LOOKUP.

PR: kern/876, kern/2635
Submitted by: David Malone <dwmalone@maths.tcd.ie> (for kern/2635)


25610 09-May-1997 dfr

Fix memory leak caused by the fact that the directory offset cookies and
the sillyrename information are stored in the same place.


25459 04-May-1997 phk

Now I can even execute "df" on my diskless :-)


25453 04-May-1997 phk

1. Add a {pointer, v_id} pair to the vnode to store the reference to the
".." vnode. This is cheaper storagewise than keeping it in the
namecache, and it makes more sense since it's a 1:1 mapping.

2. Also handle the case of "." more intelligently rather than stuff
the namecache with pointless entries.

3. Add two lists to the vnode and hang namecache entries which go from
or to this vnode. When cleaning a vnode, delete all namecache
entries it invalidates.

4. Never reuse namecache enties, malloc new ones when we need it, free
old ones when they die. No longer a hard limit on how many we can
have.

5. Remove the upper limit on namelength of namecache entries.

6. Make a global list for negative namecache entries, limit their number
to a sysctl'able (debug.ncnegfactor) fraction of the total namecache.
Currently the default fraction is 1/16th. (Suggestions for better
default wanted!)

7. Assign v_id correctly in the face of 32bit rollover.

8. Remove the LRU list for namecache entries, not needed. Remove the
#ifdef NCH_STATISTICS stuff, it's not needed either.

9. Use the vnode freelist as a true LRU list, also for namecache accesses.

10. Reuse vnodes more aggresively but also more selectively, if we can't
reuse, malloc a new one. There is no longer a hard limit on their
number, they grow to the point where we don't reuse potentially
usable vnodes. A vnode will not get recycled if still has pages in
core or if it is the source of namecache entries (Yes, this does
indeed work :-) "." and ".." are not namecache entries any longer...)

11. Do not overload the v_id field in namecache entries with whiteout
information, use a char sized flags field instead, so we can get
rid of the vpid and v_id fields from the namecache struct. Since
we're linked to the vnodes and purged when they're cleaned, we don't
have to check the v_id any more.

12. NFS knew about the limitation on name length in the namecache, it
shouldn't and doesn't now.

Bugs:
The namecache statistics no longer includes the hits for ".."
and "." hits.

Performance impact:
Generally in the +/- 0.5% for "normal" workstations, but
I hope this will allow the system to be selftuning over a
bigger range of "special" applications. The case where
RAM is available but unused for cache because we don't have
any vnodes should be gone.

Future work:
Straighten out the namecache statistics.

"desiredvnodes" is still used to (bogusly ?) size hash
tables in the filesystems.

I have still to find a way to safely free unused vnodes
back so their number can shrink when not needed.

There is a few uses of the v_id field left in the filesystems,
scheduled for demolition at a later time.

Maybe a one slot cache for unused namecache entries should
be implemented to decrease the malloc/free frequency.


25416 03-May-1997 phk

Make nfs roots (diskless) functional again. It may still not be correct,
but it is functional.


25307 30-Apr-1997 dfr

Allow NULL rpcs on non-privileged ports at all times to work around broken
clients.

PR: kern/3298
Submitted by: Tor Egge <Tor.Egge@idi.ntnu.no>


25201 27-Apr-1997 wollman

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


25089 22-Apr-1997 dfr

Fix broken usage of nm_readdirsize and increase the socket buffers for UDP
to prevent possible socket overflows.

2.2 candidate.

PR: kern/3304
Reviewed by: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>


25023 19-Apr-1997 dfr

Fix a bug where a program which appended many small records to a file could
wind up writing zeros instead of real data when the file is on an NFSv2
mounted directory.

While tracking this bug down, I noticed that nfs_asyncio was waking *all*
the iods when a block was written instead of just one per block. Fixing this
gives a 25% performance improvment for writes on v2 (less for v3).

Both are 2.2 candidates.

PR: kern/2774


25003 18-Apr-1997 dfr

Don't allow partial buffers to be cluster-comitted.
Zero the b_dirty{off,end} after cluster-comitting a group of buffers.

With these fixes, I was able to complete a 'make world' with remote src
and obj directories.


24626 04-Apr-1997 dfr

Fix various bugs in the locking protocol, allowing proper shared locks
to be used. This should fix the lock panics that people are seeing.


24577 03-Apr-1997 dfr

The code which recovered from a modified directory situation did not check
for eof when re-caching the directory. This could cause it to loop forever
if a directory was truncated.


24381 29-Mar-1997 bde

Removed #include of <ufs/ufs/dir.h>. Nfs no longer depends on any ufs
features, and the one thing that it depended on (DIRBLKSIZ) now has
conflicting spelling.


24378 29-Mar-1997 bde

Define our own version of DIRBLKSIZ instead of (ab)using ufs's value.
Use the same value of 512 (ufs actually uses DEV_BSIZE). There are
too many versions of DIRBLKSIZ, one for ufs, one for ext2fs, one for
nfs, one for ibcs2, one for linux, one for applications, ... I think
nfs's DIRBLKSIZ needs to be a divisor of the directory blocks sizes
of all supported file systems. There is also NFS_DIRBLKSIZ, which is
different from nfs's DIRBLKSIZ but is sometimes confused with it in
comments.

Removed a bogus #ifdef KERNEL that hid the tunable constants for nfs.
This came in undocumented with the Lite2 merge although it isn't in
Lite2. It required more-bogus #define KERNEL's in fstat and pstat
to make the constants visible.

Restored a spelling fix from rev.1.17.

Removed duplicate #defines of all the the NFS mount option flags.


24330 27-Mar-1997 guido

Add code that will reject nfs requests in teh kernel from nonprivileged
ports. This option will be automatically set/cleraed when mount is run
without/with the -n option.
Reviewed by: Doug Rabson


24204 24-Mar-1997 bde

Don't include <sys/ioctl.h> in the kernel. Stage 2: include
<sys/sockio.h> instead of <sys/ioctl.h> in network files.


24101 22-Mar-1997 bde

Fixed some invalid (non-atomic) accesses to `time', mostly ones of the
form `tv = time'. Use a new function gettime(). The current version
just forces atomicicity without fixing precision or efficiency bugs.
Simplified some related valid accesses by using the central function.


23570 09-Mar-1997 bde

YAMInTheWrongDirectionF22 (part of rev.1.28.2.3: set B_CLUSTEROK for
commits).


23218 28-Feb-1997 bde

Fixed a panic in nfs_writevp(). Lite2 provided a fix for a silly
missing-parentheses bug, but this exposed a misplaced vfs_busy_pages().
This bug cost a factor of 2.5-3 in nfsv3 write performance! It should
be fixed in 2.2.

Removed some debugging code that gets triggered often in normal
operation. There are still many backwards diagnostics (#define
DIAGNOSTIC gives no diagnostics).

Submitted by: vfs_busy_pages() fix by dfr


22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


22871 18-Feb-1997 bde

Changed `#ifdef COMPAT_PRELITE2' to `#ifndef NO_COMPAT_PRELITE2' so that
old nfs mount calls are supported by default.


22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


21124 31-Dec-1996 wpaul

Fix (properly, I hope) 'panic: sillyrename dir' crash that can happen
if you do:

% cd /nfsdir
% mkdir -p foo/foo
% mv foo/foo .

nfs_sillyrename() self-destructs if you try to sillyrename a directory,
however nfs_rename() can be coerced into doing just that by the above
sequence of commands. To avoid this, nfs_rename() now checks that
v_type of the 'destination' vnode != VDIR before attempting the
sillyrename. The server correctly handles this particular situation
by returning ENOTEMPTY on the rename() attempt.

I asked if this was the correct fix for this on -hackers but nobody
ever answered.

This is a 2.2 candidate.


20407 13-Dec-1996 wollman

Convert the interface address and IP interface address structures
to TAILQs. Fix places which referenced these for no good reason
that I can see (the references remain, but were fixed to compile
again; they are still questionable).


19449 06-Nov-1996 dfr

Improve the queuing algorithms used by NFS' asynchronous i/o. The
existing mechanism uses a global queue for some buffers and the
vp->b_dirtyblkhd queue for others. This turns sequential writes into
randomly ordered writes to the server, affecting both read and write
performance. The existing mechanism also copes badly with hung
servers, tending to block accesses to other servers when all the iods
are waiting for a hung server.

The new mechanism uses a queue for each mount point. All asynchronous
i/o goes through this queue which preserves the ordering of requests.
A simple mechanism ensures that the iods are shared out fairly between
active mount points. This removes the sysctl variable vfs.nfs.dwrite
since the new queueing mechanism removes the old delayed write code
completely.

This should go into the 2.2 branch.


19070 21-Oct-1996 dfr

If a large (>4096 bytes) directory was modified, the old directory
contents are discarded, including the cached seek cookies.
Unfortunately, if the directory was larger than NFS_DIRBLKSIZ, then
this confused nfs_readdirrpc(), making it appear as if the directory
was truncated.

Reviewed by: Karl Denninger <karl@Mcs.Net>


19058 20-Oct-1996 phk

Add four sysctl variables that joerg wanted.


18888 12-Oct-1996 bde

Staticized `nfs_dwrite'.


18866 11-Oct-1996 dfr

This fixes a problem with the nfs socket handling code which happens
if a single process is performing a large number of requests (in this
case writing a large file). The writing process could monopolise the
recieve lock and prevent any other processes from recieving their
replies.

It also adds a new sysctl variable 'vfs.nfs.dwrite' which controls the
behaviour which originally pointed out the problem. When a process
writes to a file over NFS, it usually arranges for another process
(the 'iod') to perform the request. If no iods are available, then it
turns the write into a 'delayed write' which is later picked up by the
next iod to do a write request for that file. This can cause that
particular iod to do a disproportionate number of requests from a
single process which can harm performance on some NFS servers. The
alternative is to perform the write synchronously in the context of
the original writing process if no iod is avaiable for asynchronous
writing.

The 'delayed write' behaviour is selected when vfs.nfs.dwrite=1 and
the non-delayed behaviour is selected when vfs.nfs.dwrite=0. The
default is vfs.nfs.dwrite=1; if many people tell me that performance
is better if vfs.nfs.dwrite=0 then I will change the default.

Submitted by: Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>


18397 19-Sep-1996 nate

In sys/time.h, struct timespec is defined as:

/*
* Structure defined by POSIX.4 to be like a timeval.
*/
struct timespec {
time_t ts_sec; /* seconds */
long ts_nsec; /* and nanoseconds */
};

The correct names of the fields are tv_sec and tv_nsec.

Reminded by: James Drobina <jdrobina@infinet.com>


17761 21-Aug-1996 dyson

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


17186 16-Jul-1996 dfr

Various fixes from frank@fwi.uva.nl (Frank van der Linden) via
rick@snowhite.cis.uoguelph.ca:

1. Clear B_NEEDCOMMIT in nfs_write to make sure that dirty data is
correctly send to the server. If a buffer was dirtied when it was in
the B_DELWRI+B_NEEDCOMMIT state, the state of the buffer was left
unchanged and when the buffer was later cleaned, just a commit rpc was
made to the server to complete the previous write. Clearing
B_NEEDCOMMIT ensures that another write is made to the server.

2. If a server returned a server (for whatever reason) returned an
answer to a write RPC that implied that fewer bytes than requested
were written, bad things would happen.

3. The setattr operation passed on the atime in stead of the mtime to
the server. The fix is trivial.

4. XIDs always started at 0, but this caused some servers (older DEC
OSF/1 3.0 so I've been told) who had very long-lasting XID caches to
get confused if, after a reboot of a BSD client, RPCs came in with a
XID that had in the past been used before from that client. Patch is
to use the current time in seconds as a starting point for XIDs. The
patch below is not perfect, because it requires the root fs to be
mounted first. This is because of the check BSD systems do, comparing
FS time to system time.

Reviewed by: Bruce Evans, Terry Lambert.
Obtained from: frank@fwi.uva.nl (Frank van der Linden) via rick@snowhite.cis.uoguelph.ca


17096 11-Jul-1996 wollman

Modify the kernel to use the new pr_usrreqs interface rather than the old
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.


16634 23-Jun-1996 bde

Don't truncate minor or major numbers in the nfsv3 client.


16365 14-Jun-1996 phk

Fix for NFS_NOSERVER

Poul mentioned that he thought this was some kind of timing problem, and
that started me thinking. After a little poking around, I found that
nfs_timer() was completely disabled when NFS_NOSERVER was #defined.
But after looking at nfs_timer(), it seemed like it was something
required by both the client and server code, and disabling it outright
just didn't seem to make any sense. Parts of it relate only to the
NFS server side code, so I disabled those, but I re-enabled the rest
of the function and made sure that it would be called from nfs_init()
(in nfs_subs.c).

With nfs_timer() re-enabled, everything seems to work again. The only
other changes I made were to #ifdef away some variable declarations
in the NFS_NOSERVER case so that gcc would stop complaining about
unused variables.

Reviewed by: phk
Submitted by: Bill Paul <wpaul@skynet.ctr.columbia.edu>


16312 12-Jun-1996 dg

Moved the fsnode MALLOC to before the call to getnewvnode() so that the
process won't possibly block before filling in the fsnode pointer (v_data)
which might be dereferenced during a sync since the vnode is put on the
mnt_vnodelist by getnewvnode.

Pointed out by Matt Day <mday@artisoft.com>


16192 08-Jun-1996 pst

Clear flags before using an inactive buffer. This is a kludge, but
matches the code in bread().

Reviewed by: bde


15543 02-May-1996 phk

removed:
CLBYTES PD_SHIFT PGSHIFT NBPG PGOFSET CLSIZELOG2 CLSIZE pdei()
ptei() kvtopte() ptetov() ispt() ptetoav() &c &c
new:
NPDEPG

Major macro cleanup.


15480 30-Apr-1996 bde

#include <sys/filedesc.h> explicitly instead of depending on it being
bogusly included by <sys/socketvar.h>.


15479 30-Apr-1996 bde

Fixed nfs sysctls. They missed out on the fs -> vfs name changes from
Lite2. This broke nfsstat.


14093 13-Feb-1996 wollman

Kill XNS.
While we're at it, fix socreate() to take a process argument. (This
was supposed to get committed days ago...)


13765 30-Jan-1996 mpp

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


13625 25-Jan-1996 bde

Fixed spelling of s_namlen so that this compiles again.


13619 24-Jan-1996 phk

Use new printf features rather than local kludges.


13612 24-Jan-1996 mpp

Add a check to prevent a computation from underflowing and causing
a panic due to an attaempt to allocate a buffer for a terabyte or
so of data when an attempt is made to create sparse data (e.g.
a holey file) more than 1 block past the end of the file.

Note: some other areas of this code need to be looked at,
since they might cause problems when the file size exceeds 2GB,
due to storing results in ints when the computations are being
done with quad sized variables.

Reviewed by: bde


13490 19-Jan-1996 dyson

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


13416 13-Jan-1996 phk

Add an option NFS_NOSERVER which saves 100K in the install kernel (or
any other kernel that uses it). Use with option NFS.


13084 28-Dec-1995 phk

Don't print swap server as root server.
Submitted by: Mattias.Gronlund@sa.erisoft.se (Mattias Gronlund)


12970 22-Dec-1995 phk

Move fs.nfs.nfsstats sysctl var back to it's old OID.


12911 17-Dec-1995 phk

Staticize.


12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


12588 03-Dec-1995 bde

Completed function declarations and/or added prototypes and/or moved
prototypes to the right place.


12457 21-Nov-1995 bde

Completed function declarations, added prototypes and removed redundant
declarations.


12453 21-Nov-1995 bde

Completed function declarations and/or added prototypes.


12287 14-Nov-1995 phk

Get rid of hostnamelen variable.


12274 14-Nov-1995 bde

Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.


12158 09-Nov-1995 bde

Introduced a type `vop_t' for vnode operation functions and used
it 1138 times (:-() in casts and a few more times in declarations.
This change is null for the i386.

The type has to be `typedef int vop_t(void *)' and not `typedef
int vop_t()' because `gcc -Wstrict-prototypes' warns about the
latter. Since vnode op functions are called with args of different
(struct pointer) types, neither of these function types is any use
for type checking of the arg, so it would be preferable not to use
the complete function type, especially since using the complete
type requires adding 1138 casts to avoid compiler warnings and
another 40+ casts to reverse the function pointer conversions before
calling the functions.


12118 06-Nov-1995 bde

Replaced bogus macros for dummy devswitch entries by functions.
These functions went away:

enosys (hasn't been used for some time)
enxio
enodev
enoioctl (was used only once, actually for a vop)

if_tun.c:
Continued cleaning up...

conf.h:
Probably fixed the type of d_reset_t. It is hard to tell the correct
type because there are no non-dummy device reset functions.

Removed last vestige of ambiguous sleep message strings.


11982 31-Oct-1995 joerg

Include a prerequisite header (so this is consistent again with the
NFSv2 state).


11921 29-Oct-1995 phk

Second batch of cleanup changes.
This time mostly making a lot of things static and some unused
variables here and there.


11645 22-Oct-1995 dg

Fix order problem: unbusy pages before releasing the buffer.

Submitted by: John Dyson <dyson>


11644 22-Oct-1995 dg

Moved the filesystem read-only check out of the syscalls and into the
filesystem layer, as was done in lite-2. Merged in some other cosmetic
changes while I was at it. Rewrote most of msdosfs_access() to be more
like ufs_access() and to include the FS read-only check.

Obtained from: partially from 4.4BSD-lite2


10551 04-Sep-1995 dyson

Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count
for VOP_BMAP. Updated affected filesystems...


10478 30-Aug-1995 dfr

Make nfs diskless work again.

Reviewed by: John Hay <jhay@mikom.csir.co.za>


10223 24-Aug-1995 dg

Killed redundant declarations of nfsm_rpchead().


10222 24-Aug-1995 dfr

Some fixes found using gcc -Wall:

nfsm_rpchead() has been called with the wrong number of args and misplaced
args since someone added new args in the middle for nfsv3.

Here's another one that would be important on 64-bit systems. VOP_READDIR
takes a `u_int **cookies' arg.

Submitted by: Bruce Evans <bde@zeta.org.au>


10219 24-Aug-1995 dfr

Add support for amd direct maps.

Reviewed by: Thomas Graichen <graichen@sirius.physik.fu-berlin.de>


10027 11-Aug-1995 dg

Converted mountlist to a CIRCLEQ.

Partially obtained from: 4.4BSD-Lite2


9842 01-Aug-1995 dg

Removed my special-case hack for VOP_LINK and fixed the problem with the
wrong vp's ops vector being used by changing the VOP_LINK's argument order.
The special-case hack doesn't go far enough and breaks the generic
bypass routine used in some non-leaf filesystems. Pointed out by Kirk
McKusick.


9759 29-Jul-1995 bde

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


9681 24-Jul-1995 dfr

Slightly better fix than previous revision.

Submitted by: Rick Macklem <rick@snowhite.cis.uoguelph.ca>


9679 24-Jul-1995 dfr

Fix a problem which appeared to truncate a file to the nearest block boundary
when it is moved to an NFS filesystem from from another filesystem and /bin/mv
failed to set the file ownership during the move.

I believe that this bug is present in STABLE but I have not tested it. The fix
would be the same in STABLE even though the code has changed quite considerably
in CURRENT.


9627 22-Jul-1995 dg

Correct my cut-'n-paste job from ffs_vfsops.c and fix up the formatting
to be similar.

Submitted by: Bruce Evans


9604 21-Jul-1995 dg

Implemented an nfs_node hash list lock, similar to what was implemented
in ffs_vget(), and for the same reason: to prevent a race condition that
results in duplicate vnodes/NFSnodes being allocated.


9588 20-Jul-1995 dg

vnode_pager_alloc() never returns NULL, so don't check for it.


9522 13-Jul-1995 dfr

I believe that the following fix to nfs_vnops.c should do the trick w.r.t.
the problem "when a file is truncated on the server after being written on
a client under NFSv3, the client doesn't see the size drop to zero".
(As you noted, the problem is that NMODIFIED wasn't being cleared by nfs_close
when it flushed the buffers. After checking through the code, the only place
where NMODIFIED was used to test for the possibility of dirty blocks was in
nfs_setattr(). The two cases are safe to do when there aren't dirty blocks,
so I just took out the tests. Unfortunately, testing for
v_dirtyblkhd.lh_first being non-null is not sufficient, since there are
times when the code moves blocks to the clean list and then back to the
dirty list.)

Submitted by: rick@snowhite.cis.uoguelph.ca


9507 13-Jul-1995 dg

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


9456 09-Jul-1995 dg

Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places
that call vnode_pager_alloc() so that a failure return can be dealt with.
This fixes a panic seen on NFS clients when a file being opened is deleted
on the server before the open completes.


9428 07-Jul-1995 dfr

Use a consistent blocksize for sizing bufs to avoid panicing the bio system.


9365 28-Jun-1995 dfr

Use the correct cred for nfs_commit operations.


9356 28-Jun-1995 dg

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


9354 28-Jun-1995 dg

Fixed VOP_LINK argument order botch.


9336 27-Jun-1995 dfr

Changes to support version 3 of the NFS protocol.
The version 2 support has been tested (client+server) against FreeBSD-2.0,
IRIX 5.3 and FreeBSD-current (using a loopback mount). The version 2 support
is stable AFAIK.
The version 3 support has been tested with a loopback mount and minimally
against an IRIX 5.3 server. It needs more testing and may have problems.
I have patched amd to support the new variable length filehandles although
it will still only use version 2 of the protocol.

Before booting a kernel with these changes, nfs clients will need to at least
build and install /usr/sbin/mount_nfs. Servers will need to build and
install /usr/sbin/mountd.

NFS diskless support is untested.

Obtained from: Rick Macklem <rick@snowhite.cis.uoguelph.ca>


9221 14-Jun-1995 joerg

The duplicate information returned in fa_type and fa_mode
is an ambiguity in the NFS version 2 protocol.

VREG should be taken literally as a regular file. If a
server intents to return some type information differently
in the upper bits of the mode field (e.g. for sockets, or
FIFOs), NFSv2 mandates fa_type to be VNON. Anyway, we
leave the examination of the mode bits even in the VREG
case to avoid breakage for bogus servers, but we make sure
that there are actually type bits set in the upper part of
fa_mode (and failing that, trust the va_type field).

NFSv3 cleared the issue, and requires fa_mode to not
contain any type information (while also introduing sockets
and FIFOs for fa_type).

The fix has been tested against a variety of NFS servers.
It fixes problems with the ``Tropic'' NFS server for Windows,
while apparently not breaking anything.

Pointed-out by: scott@zorch.sf-bay.org (Scott Hazen Mueller)


9202 11-Jun-1995 rgrimes

Merge RELENG_2_0_5 into HEAD


8876 30-May-1995 rgrimes

Remove trailing whitespace.


8832 29-May-1995 dg

Fixed some serious bugs that resulted in object reference counts not being
handled correctly. This would manifest itself as "object deallocated too
many times" panics and perhaps other strange inconsistencies on NFS servers.

Reviewed by: me, of course
Submitted by: John Dyson


8692 21-May-1995 dg

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


8504 14-May-1995 dg

Changed swap partition handling/allocation so that it doesn't
require specific partitions be mentioned in the kernel config
file ("swap on foo" is now obsolete).

From Poul-Henning:

The visible effect is this:

As default, unless
options "NSWAPDEV=23"
is in your config, you will have four swap-devices.
You can swapon(2) any block device you feel like, it doesn't have
to be in the kernel config.

There is a performance/resource win available by getting the NSWAPDEV right
(but only if you have just one swap-device ??), but using that as default
would be too restrictive.

The invisible effect is that:

Swap-handling disappears from the $arch part of the kernel.
It gets a lot simpler (-145 lines) and cleaner.

Reviewed by: John Dyson, David Greenman
Submitted by: Poul-Henning Kamp, with minor changes by me.


7969 21-Apr-1995 dyson

Slight re-ordering of the creation of a vmio object to fix a condition
that can cause NFS I/O failures.


7871 16-Apr-1995 dg

Various fixes from John Dyson:

1) Rewrote screwy code that uses an incore buffer without making it busy.
2) Use B_CACHE instead of B_DONE in cases where it is appropriate.
3) Minor code optimization.

This *might* fix kern/345 submitted by Heikki Suonsivu.


7275 23-Mar-1995 dg

Deleted bogus DIAGNOSTIC "nfs_fsync: dirty" message. This can and does
happen normally when there is heavy write activity to a file since the
vnode isn't locked (NFS plays fast and loose with vnode locks). This change
"fixes" PR#267.


7095 16-Mar-1995 wollman

Add four more filesystem flags:

VFCF_NETWORK (this FS goes over the net)
VFCF_READONLY (read-write mounts do not make any sense)
VFCF_SYNTHETIC (data in this FS is not real)
VFCF_LOOPBACK (this FS aliases something else)

cd9660 is readonly; nullfs, umapfs, and union are loopback; NFS is netowkr;
procfs, kernfs, and fdesc are synthetic.


7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


6875 04-Mar-1995 dg

Removed obsolete vtrace() remnants.


6420 15-Feb-1995 phk

YF fix.


6417 15-Feb-1995 dg

Fixed two more bugs related to the merged cache changes.

Submitted by: John Dyson


6361 14-Feb-1995 phk

YFfix
+int nfsrv_vput __P(( struct vnode * ));
+int nfsrv_vrele __P(( struct vnode * ));
+int nfsrv_vmio __P(( struct vnode * ));


6210 06-Feb-1995 dg

Changed order of release of vnode/object to fix a problem where the vnode
is freed with an old object still attached (subsequently causing a panic).
Fixes NFS server panic "object/pager mismatch".

Submitted by: John Dyson


6151 03-Feb-1995 dg

Fixed bmap run-length brokeness.
Use bmap run-length extension when doing clustered paging.

Submitted by: John Dyson


6148 03-Feb-1995 dg

Removed a pile of vfs_unbusy_pages()...both unnecessary and wrong - resulted
in serious system instability. Changed a B_INVAL to a B_NOCACHE so that
buffer data is properly disposed of.

Submitted by: John Dyson, Rick Macklin, and ohki@gssm.otsuka.tsukuba.ac.jp


5471 10-Jan-1995 dg

Added two missing brelse() calls.

Submitted by: rick@snowhite.cis.uoguelph.ca


5455 09-Jan-1995 dg

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


5353 03-Jan-1995 jkh

From Gene Stark:
If nd->swap_nblks is zero in nfs_mountroot(), then the system
comes up without initializing swapdev_vp to an actual vnode pointer.
The swap pager assumes a non-NULL value for swapdev_vp.

The fix is to try initializing local swap if no NFS swap space
is specified.


5013 08-Dec-1994 phk

Would you please correct nfs/nfs_vfsops.c so that the ip address of the
root filesystem is printed out correctly?
It's line 299 in nfs/nfs_vfsops.c.

Reviewed by: phk
Submitted by: Luigi Rizzo luigi@iet.unipi.it


4067 02-Nov-1994 wollman

Forward-declare a few structures to avoid warning messages.


3820 23-Oct-1994 wollman

Implement fs.nfs MIB variables.


3797 22-Oct-1994 phk

This is where the action is. I'm still not sure that swap is 100% OK, but
it seems to work.


3664 17-Oct-1994 phk

This is a bunch of changes from NetBSD. There are a couple of bug-fixes.
But mostly it is changes to use the list-maintenance macros instead of
doing the pointer-gymnastics by hand.

Obtained from: NetBSD


3451 09-Oct-1994 dg

Got rid of map.h. It's a leftover from the rmap code, and we use rlists.
Changed swapmap into swaplist.


3396 06-Oct-1994 dg

Use tsleep() rather than sleep so that 'ps' is more informative about
the wait.


3305 02-Oct-1994 phk

Prototyping and general gcc-shutting up. Gcc has one warning now which looks
bad, I will get to it eventually, unless somebody beats me to it.


2997 22-Sep-1994 wollman

Make NFS loadable.


2979 22-Sep-1994 wollman

More loadable VFS changes:

- Make a number of filesystems work again when they are statically compiled
(blush)

- FIFOs are no longer optional; ``options FIFO'' removed from distributed
config files.


2946 21-Sep-1994 wollman

Implemented loadable VFS modules, and made most existing filesystems
loadable. (NFS is a notable exception.)


2384 29-Aug-1994 dg

"bogus" fixes from 1.1.5 to work around some cache coherency problems.


2175 21-Aug-1994 paul

More idempotency....... this is fun :-)


2152 20-Aug-1994 dg

Implemented filesystem clean bit via:

machdep.c:
Changed printf's a little and call vfs_unmountall() if the sync was
successful.

cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c:
Allow dismount of root FS. It is now disallowed at a higher level.

vfs_conf.c:
Removed unused rootfs global.

vfs_subr.c:
Added new routines vfs_unmountall and vfs_unmountroot. Filesystems
are now dismounted if the machine is properly rebooted.

ffs_vfsops.c:
Toggle clean bit at the appropriate places. Print warning if an
unclean FS is mounted.

ffs_vfsops.c, lfs_vfsops.c:
Fix bug in selecting proper flags for VOP_CLOSE().

vfs_syscalls.c:
Disallow dismounting root FS via umount syscall.


2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


2010 10-Aug-1994 dg

Initialize lockf pointer. I missed this when I made NFS use the generic
advlock mechanism, and not doing so results in random system crashes.


1979 09-Aug-1994 dg

Removed some padding bytes from the nfsnode struct to make the structure
size a power of 2 again. The system complains otherwise - probably because
it wastes space with our malloc scheme otherwise.


1960 08-Aug-1994 dg

Made lockf advisory locking code generic (rather than ufs specific), and
use it in NFS. This is required both for diskless support and for POSIX
compliance. Note: the support in NFS is only for the local node.

Submitted by: based on work originally done by Yuval Yurom


1937 08-Aug-1994 dg

Changed B_AGE policy to work correctly in a world with relatively large
buffer caches. The old policy generally ended up caching nothing.


1858 05-Aug-1994 dg

Converted 'vmunix' to 'kernel'.


1828 04-Aug-1994 dg

Made NFS attribute cache timeouts kernel config file tunable via
NFS_MINATTRTIMO and NFS_MAXATTRTIMO.


1817 02-Aug-1994 dg

Added $Id$


1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources