History log of /freebsd-11-stable/sys/compat/freebsd32/freebsd32_misc.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 366281 30-Sep-2020 kib

MFC r366085, r366113:
Do not leak oldvmspace if image activation failed


# 363919 05-Aug-2020 markj

MFC r363917:
Fix a TOCTOU vulnerability in freebsd32_copyin_control().

PR: 248257
Reported by: m00nbsd working with Trend Micro Zero Day Initiative
Reviewed by: kib
Security: SA-20:23.sendmsg
Security: CVE-2020-7460
Security: ZDI-CAN-11543


# 360225 23-Apr-2020 brooks

MFC r359938:

Remove bogus use of useracc() in (clock_)nanosleep.

There's no point in pre-checking that we can access the user's rmtp
pointer before we do it in copyout().

While here, improve style(9) compliance.

Reviewed by: imp
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24409


# 352125 10-Sep-2019 kib

MFC r351773:
Add procctl(PROC_STACKGAP_CTL).

PR: 239894


# 346020 07-Apr-2019 jah

MFC r345741:

freebsd32: fix padding of computed control message length for recvmsg()

Each control message region must be aligned on a 4-byte boundary on 32-bit
architectures. The 32-bit compat shim for recvmsg() gets the actual layout
right, but doesn't pad the payload length when computing msg_controllen for
the output message header. If a control message contains an unaligned
payload, such as the 1-byte TTL field in the example attached to PR 236737,
this can produce control message payload boundaries that extend beyond
the boundary reported by msg_controllen.

PR: 236737


# 339052 01-Oct-2018 asomers

MFC r336871, r336874

r336871:
getrusage(2): fix return value under 32-bit emulation

According to the man page, getrusage(2) should return EFAULT if the rusage
argument lies outside of the process's address space. But due to an
oversight in r100384, that's never been the case during 32-bit emulation.
Fix it.

PR: 230153
Reported by: tests(7)
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D16500

r336874:
freebsd32_getrusage(2): skip freebsd32_rusage_out on error

PR: 230153
Reported by: kib
X-MFC-With: 336871
Differential Revision: https://reviews.freebsd.org/D16500


# 338618 12-Sep-2018 markj

MFC r337423:
Improve handling of control message truncation.

PR: 131876


# 338617 12-Sep-2018 sobomax

MFC r312296 and r323254, which is new a socket option
SO_TS_CLOCK to pick from several different clock sources to
return timestamps when SO_TIMESTAMP is enabled and two
new nanosecond-precision timestamp types. This also fixes
recvmsg32() system call to properly down-convert layout of the
64-bit structures to match what 32-bit app(s) expect.

Bump __FreeBSD_version to indicate presence of a new
functionality.

Differential Revision: https://reviews.freebsd.org/D9171


# 337427 07-Aug-2018 kib

MFC r336980:
Provide compat32 shims for sched_rr_get_interval(2).

PR: 230175


# 337046 01-Aug-2018 jhb

MFC 332782:
Simplify the code to allocate stack for auxv, argv[], and environment vectors.

Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().


# 333162 02-May-2018 kib

MFC r332740:
Add PROC_PDEATHSIG_SET to procctl interface.

MFC r332825:
Rename PROC_PDEATHSIG_SET -> PROC_PDEATHSIG_CTL.

MFC r333067:
Remove redundant pipe from pdeathsig.c test.


# 331922 03-Apr-2018 kib

MFC r331640:
Fix several leaks of kernel stack data through paddings.


# 328454 26-Jan-2018 jhb

MFC 326184: Decode kevent structures logged via ktrace(2) in kdump.

- Add a new KTR_STRUCT_ARRAY ktrace record type which dumps an array of
structures.

The structure name in the record payload is preceded by a size_t
containing the size of the individual structures. Use this to
replace the previous code that dumped the kevent arrays dumped for
kevent(). kdump is now able to decode the kevent structures rather
than dumping their contents via a hexdump.

One change from before is that the 'changes' and 'events' arrays are
not marked with separate 'read' and 'write' annotations in kdump
output. Instead, the first array is the 'changes' array, and the
second array (only present if kevent doesn't fail with an error) is
the 'events' array. For kevent(), empty arrays are denoted by an
entry with an array containing zero entries rather than no record.

- Move kevent decoding tables from truss to libsysdecode.

This adds three new functions to decode members of struct kevent:
sysdecode_kevent_filter, sysdecode_kevent_flags, and
sysdecode_kevent_fflags.

kdump uses these helper functions to pretty-print kevent fields.

- Move structure definitions for freebsd11 and freebsd32 kevent
structures to <sys/event.h> so that they can be shared with userland.
The 32-bit structures are only exposed if _WANT_KEVENT32 is defined.
The freebsd11 structures are only exposed if _WANT_FREEBSD11_KEVENT is
defined. The 32-bit freebsd11 structure requires both.

- Decode freebsd11 kevent structures in truss for the compat11.kevent()
system call.

- Log 32-bit kevent structures via ktrace for 32-bit compat kevent()
system calls.

- While here, constify the 'void *data' argument to ktrstruct().

Note that this version of the change for 11.x does not include freebsd11
kevent structures or _WANT_FREEBSD11_KEVENT. It also does not include
the change to decode the compat11.kevent system call in truss.


# 325866 15-Nov-2017 gordon

MFC r325865

Properly bzero kldstat structure to prevent kernel information leak.

Security: FreeBSD-SA-17:10.kldstat
Security: CVE-2017-1088


# 318244 12-May-2017 brooks

MFC r317845-r317846

r317845:
Provide a freebsd32 implementation of sigqueue()

The previous misuse of sys_sigqueue() was sending random register or
stack garbage to 64-bit targets. The freebsd32 implementation preserves
the sival_int member of value when signaling a 64-bit process.

Document the mixed ABI implementation of union sigval and the
incompability of sival_ptr with pointer integrity schemes.

Reviewed by: kib, wblock
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10605

r317846:
Regen post r317845.

MFC with: r317845
Sponsored by: DARPA, AFRL


# 317618 01-May-2017 vangyzen

MFC r315526

Add clock_nanosleep()

Add a clock_nanosleep() syscall, as specified by POSIX.
Make nanosleep() a wrapper around it.

Attach the clock_nanosleep test from NetBSD. Adjust it for the
FreeBSD behavior of updating rmtp only when interrupted by a signal.
I believe this to be POSIX-compliant, since POSIX mentions the rmtp
parameter only in the paragraph about EINTR. This is also what
Linux does. (NetBSD updates rmtp unconditionally.)

Copy the whole nanosleep.2 man page from NetBSD because it is complete
and closely resembles the POSIX description. Edit, polish, and reword it
a bit, being sure to keep any relevant text from the FreeBSD page.

Regenerate syscall files.

Relnotes: yes
Sponsored by: Dell EMC


# 315657 21-Mar-2017 vangyzen

MFC r315510

nanosleep: plug a kernel memory disclosure

nanosleep() updates rmtp on EINVAL. In that case, kern_nanosleep()
has not updated rmt, so sys_nanosleep() updates the user-space rmtp
by copying garbage from its stack frame. This is not only a kernel
memory disclosure, it's also not POSIX-compliant. Fix it to update
rmtp only on EINTR.

Security: possibly
Sponsored by: Dell EMC


# 315555 19-Mar-2017 trasz

MFC r313281:

Add kern_cpuset_getaffinity() and kern_cpuset_getaffinity(),
and use it in compats instead of their sys_*() counterparts.

Sponsored by: DARPA, AFRL


# 315554 19-Mar-2017 trasz

MFC r313015:

Add kern_cpuset_getid() and kern_cpuset_setid(), and use them
in compat32 instead of their sub_*() counterparts.

Sponsored by: DARPA, AFRL


# 315553 19-Mar-2017 trasz

MFC r313018:

Add kern_pread() and kern_pwrite(), and use it in compats instead
of their sys_*() counterparts. The svr4 is left unchanged.

Sponsored by: DARPA, AFRL


# 315551 19-Mar-2017 trasz

MFC r313016:

Replace calls to sys_truncate() with kern_truncate().

Sponsored by: DARPA, AFRL


# 315550 19-Mar-2017 trasz

MFC r312987:

Add kern_lseek() and use it instead of sys_lseek() in various compats.
I didn't touch svr4/, there's no point.

Sponsored by: DARPA, AFRL


# 315548 19-Mar-2017 trasz

MFC r312986:

Replace sys_ftruncate() with kern_ftruncate() in various compats.

Sponsored by: DARPA, AFRL


# 315471 18-Mar-2017 kib

MFC r315238:
Use designated initializers for kevent_copyops.


# 314334 27-Feb-2017 kib

MFC kern_mmap(9) and related helpers.

MFC r302514 (by rwatson):
Audit file-descriptor arguments to I/O system calls such as
read(2), write(2), dup(2), and mmap(2).

MFC r302524 (by rwatson):
When mmap(2) is used with a vnode, capture vnode attributes in the
audit trail.

MFC r313352 (by trasz):
Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(),
kern_vm_munmap(), and kern_vm_madvise().

MFC r313655:
Change type of the prot parameter for kern_vm_mmap() from vm_prot_t to int.

MFC r313696:
Rework r313352.


# 313793 16-Feb-2017 kib

MFC r313692:
Style: wrap long line.


# 313450 08-Feb-2017 jhb

MFC 310638:
Rename the 'flags' argument to getfsstat() to 'mode' and validate it.

This argument is not a bitmask of flags, but only accepts a single value.
Fail with EINVAL if an invalid value is passed to 'flag'. Rename the
'flags' argument to getmntinfo(3) to 'mode' as well to match.

This is a followup to r308088.


# 311957 12-Jan-2017 kib

MFC r311452:
Do not allocate struct statfs on kernel stack.


# 311956 12-Jan-2017 kib

MFC r311447:
Some style fixes for getfstat(2)-related code.


# 306398 28-Sep-2016 kib

MFC r306081:
Add PROC_TRAPCAP procctl(2) controls and global sysctl kern.trap_enocap.


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 302095 22-Jun-2016 brooks

Generate syscall tables and update pipe() implementation after r302094.

Mark the pipe() system call as COMPAT10.

As of r302092 libc uses pipe2() with a zero flags value instead of pipe().

Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D6816


# 297400 29-Mar-2016 glebius

The sendfile(2) allows to send extra data from userspace before the file
data (headers). Historically the size of the headers was not checked
against the socket buffer space. Application could easily overcommit the
socket buffer space.

With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space. In case when size of headers is bigger that socket space,
KASSERT fires. Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.

o With this change, the headers copyin is moved down into the cycle, after
the sbspace() check. The uio size is trimmed by socket space there,
which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
up the stack to syscall wrappers. This required a copy and paste of the
code, but in turn this allowed to remove extra stack carried parameter
from fo_sendfile_t, and embrace entire compat code into #ifdef. If in
future we got more fo_sendfile_t function, the copy and paste level would
even reduce.

Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by: Vitalij Satanivskij <satan ukr.net>
Sponsored by: Netflix


# 296060 25-Feb-2016 markj

Improve error handling for posix_fallocate(2) and posix_fadvise(2).

- Set td_errno so that ktrace and dtrace can obtain the syscall error
number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
not intended to be returned to userland.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425


# 282708 10-May-2015 kib

On exec, single-threading must be enforced before arguments space is
allocated from exec_map. If many threads try to perform execve(2) in
parallel, the exec map is exhausted and some threads sleep
uninterruptible waiting for the map space. Then, the thread which won
the race for the space allocation, cannot single-thread the process,
causing deadlock.

Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 282448 05-May-2015 peter

Fix an error in r281551, part of the getfsstat() / kern_getfsstat()
rework. The number of entries was supposed to be returned to the user,
not used as a scratch variable.

This broke RELENG_4 jails starting up on current systems.


# 281551 15-Apr-2015 trasz

Rewrite linprocfs_domtab() as a wrapper around kern_getfsstat(). This
adds missing jail and MAC checks.

Differential Revision: https://reviews.freebsd.org/D2193
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 277610 23-Jan-2015 jilles

Add futimens and utimensat system calls.

The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision: https://reviews.freebsd.org/D1426
Submitted by: pluknet (partially)
Reviewed by: delphij, pluknet, rwatson
Relnotes: yes


# 277322 18-Jan-2015 kib

Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger
attachment to the process. Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.

Discussed with: jilles, rwatson
Man page fixes by: rwatson
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 277211 15-Jan-2015 kib

fcntl F_O{GET,SET}LK take pointer as the arg, handle them properly for
compat32.

Reported and tested by: Alex Tutubalin <lexa@lexa.ru>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 275800 15-Dec-2014 kib

Add a facility for non-init process to declare itself the reaper of
the orphaned descendants. Base of the API is modelled after the same
feature from the DragonFlyBSD.

Requested by: bapt
Reviewed by: jilles (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 274476 13-Nov-2014 kib

Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 274462 13-Nov-2014 dchagin

Add the ppoll() system call.
Export kern_poll() needed by an upcoming Linuxulator change.

Differential Revision: https://reviews.freebsd.org/D1133
Reviewed by: kib, wblock
MFC after: 1 month


# 274408 11-Nov-2014 glebius

Fix build.


# 274403 11-Nov-2014 glebius

Remove SF_KQUEUE code. This code was developed at Netflix, but was not
ever used. It didn't go into stable/10, neither was documented.
It might be useful, but we collectively decided to remove it, rather
leave it abandoned and unmaintained. It is removed in one single
commit, so restoring it should be easy, if anyone wants to reopen
this idea.

Sponsored by: Netflix


# 273784 28-Oct-2014 kib

Replace some calls to fuword() by fueword() with proper error checking.

Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 3 weeks


# 273707 26-Oct-2014 mjg

Avoid dynamic syscall overhead for statically compiled modules.

The kernel tracks syscall users so that modules can safely unregister them.

But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.

Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.

Reviewed by: kib (previous version)
MFC after: 2 weeks


# 272132 25-Sep-2014 kib

Fix fcntl(2) compat32 after r270691. The copyin and copyout of the
struct flock are done in the sys_fcntl(), which mean that compat32 used
direct access to userland pointers.

Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which
performs neccessary userland memory accesses, and use it from both
native and compat32 fcntl syscalls.

Reported by: jhibbits
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 270691 26-Aug-2014 kib

Fix handling of the third argument for fcntl(2). The native syscall
uses long for arg, which needs translation.

Discussed with and tested by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 268351 06-Jul-2014 marcel

Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan


# 264164 05-Apr-2014 marcel

In freebsd32_sendmsg(), replace the call to sockargs() followed by a
call to freebsd32_convert_msg_in() with freebsd32_copyin_control() to
readin and convert in a single step. This makes it simpler to put all
the control messages in a single mbuf or mbuf cluster as per the
limitations imposed upon us by ip6_setpktopts().

The logic is as follows:
1. Go over the array of control messages to determine overall size
and include extra padding for proper alignment as we go.
2. Get a mbuf or mbuf cluster as needed or fail if the overall
(adjusted) size is larger than a cluster.
3. Go over the array of control messages again, but now copy them
into kernel space and into aligned offsets.
4. Update the length of the control message to take padding between
the header and the data into account (but not for padding added
between one control message and the next).

Obtained from: Juniper Networks, Inc.
MFC after: 1 week


# 263954 30-Mar-2014 imp

Remove instances of variables that were set, but never used. gcc 4.9
warns about these by default.


# 263349 19-Mar-2014 kib

Make the array pointed to by AT_PAGESIZES auxv properly aligned.

Also, remove the expression which calculated the location of the
strings for a new image and grown over the time to be
non-comprehensible. Instead, calculate the offsets by steps, which
also makes fixing the alignments much cleaner.

Reported and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 263233 16-Mar-2014 rwatson

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# 263214 15-Mar-2014 jmg

change td_retval into a union w/ off_t, with defines to mask the
change... This eliminates a cast, and also forces td_retval
(often 2 32-bit registers) to be aligned so that off_t's can be
stored there on arches with strict alignment requirements like
armeb (AVILA)... On i386, this doesn't change alignment, and on
amd64 it doesn't either, as register_t is already 64bits...

This will also prevent future breakage due to people adding additional
fields to the struct...

This gets AVILA booting a bit farther...

Reviewed by: bde


# 261290 30-Jan-2014 kib

The posix_madvise(3) and posix_fadvise(2) should return error on
failure, same as posix_fallocate(2).

Noted by: Bob Bishop <rb@gid.co.uk>
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 261080 23-Jan-2014 kib

The posix_fallocate(2) syscall should return error number on error,
without modifying errno.

Reported and tested by: Gennady Proskurin <gpr@mail.ru>
Reviewed by: mdf
PR: standards/186028
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 260806 17-Jan-2014 adrian

Implement a kqueue notification path for sendfile.

This fires off a kqueue note (of type sendfile) to the configured kqfd
when the sendfile transaction has completed and the relevant memory
backing the transaction is no longer in use by this transaction.
This is analogous to SF_SYNC waiting for the mbufs to complete -
except now you don't have to wait.

Both SF_SYNC and SF_KQUEUE should work together, even if it
doesn't necessarily make any practical sense.

This is designed for use by applications which use backing cache/store
files (eg Varnish) or POSIX shared memory (not sure anything is using
it yet!) to know when a region of memory is free for re-use. Note
it doesn't mark the region as free overall - only free from this
transaction. The application developer still needs to track which
ranges are in the process of being recycled and wait until all
pending transactions are completed.

TODO:

* documentation, as always

Sponsored by: Netflix, Inc.


# 260461 08-Jan-2014 adrian

Refactor out the common sendfile code from the do_sendfile() and the
compat32 sendfile syscall.

Sponsored by: Netflix, Inc.


# 258788 01-Dec-2013 adrian

Migrate the sendfile_sync structure into a public(ish) API in preparation
for extending and reusing it.

The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper,
used to indicate that the backing store for a group of mbufs has completed.
It's only being used by sendfile for now and it's only implementing a
sleep/wakeup rendezvous. However, there are other potential signaling
paths (kqueue) and other potential uses (socket zero-copy write) where the
same mechanism would also be useful.

So, with that in mind:

* extract the sendfile_sync code out into sf_sync_*() methods
* teach the sf_sync_alloc method about the current config flag -
it will eventually know about kqueue.
* move the sendfile_sync code out of do_sendfile() - the only thing
it now knows about is the sfs pointer. The guts of the sync
rendezvous (setup, rendezvous/wait, free) is now done in the
syscall wrapper.
* .. and teach the 32-bit compat sendfile call the same.

This should be a no-op. It's primarily preparation work for teaching
the sendfile_sync about kqueue notification.

Tested:

* Peter Holm's sendfile stress / regression scripts

Sponsored by: Netflix, Inc.


# 258718 28-Nov-2013 peter

jail_v0.ip_number was always in host byte order. This was handled
in one of the many layers of indirection and shims through stable/7
in jail_handle_ips(). When it was cleaned up and unified through
kern_jail() for 8.x, the byte order swap was lost.

This only matters for ancient binaries that call jail(2) themselves
internally.


# 258621 26-Nov-2013 adrian

Fix the compat32 sendfile() to be in line with my recent changes.

Reminded by: kib


# 255708 19-Sep-2013 jhb

Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.

Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month


# 255426 09-Sep-2013 jhb

Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space. This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address. While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by: alc
Approved by: re (kib)


# 255219 04-Sep-2013 pjd

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 254490 18-Aug-2013 pjd

Move the PAIR32TO64() macro and the RETVAL_HI/RETVAL_LO defines to a
header file for use by other .c files.

Sponsored by: The FreeBSD Foundation


# 254356 15-Aug-2013 glebius

Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 253530 21-Jul-2013 kib

Implement compat32 wrappers for the ktimer_* syscalls.

Reported, reviewed and tested by: Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 253527 21-Jul-2013 kib

Move the convert_sigevent32() utility function into freebsd32_misc.c
for consumption outside the vfs_aio.c.

For SIGEV_THREAD_ID and SIGEV_SIGNAL notification delivery methods,
also copy in the sigev_value, since librt event pumping loop compares
note generation number with the value passed through sigev_value.

Tested by: Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 253525 21-Jul-2013 kib

Cosmetic change, use the same union name on the left and right sides
of the conversion.

Tested by: Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 253494 20-Jul-2013 kib

id_t is 64bit, provide the compat32 wrapper for clock_getcpuclockid2(2).

Reported and tested by: Petr Salinger <Petr.Salinger@seznam.cz>
PR: threads/180652
Sponsored by: The FreeBSD Foundation


# 251198 31-May-2013 obrien

Add a "kern.features" MIB for 32bit support under a 64bit kernel.


# 250853 21-May-2013 kib

Fix the wait6(2) on 32bit architectures and for the compat32, by using
the right type for the argument in syscalls.master. Also fix the
posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the
architectures which require padding of the 64bit argument.

Noted and reviewed by: jhb
Pointy hat to: kib
MFC after: 1 week


# 243133 16-Nov-2012 kib

Style fixes for r242958.

Reported and reviewed by: bde
MFC after: 28 days


# 242958 13-Nov-2012 kib

Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month


# 235886 24-May-2012 gleb

Add kern_fhstat(), adjust sys_fhstat() to use it.

Extend kern_getdirentries() to accept uio segflag and optionally return
buffer residue.

Sponsored by: Google Summer of Code 2011


# 232475 03-Mar-2012 jmallett

On MIPS, _ALIGN always aligns to 8 bytes, even for 32-bit binaries. This might
not be ideal, but is the ABI we've shipped so far. Fix macros which reflect
the results of _ALIGN on 32-bit MIPS to use the right alignment.

This fixes sendmsg under COMPAT_FREEBSD32 on n64 MIPS kernels.


# 232449 03-Mar-2012 jmallett

o) Add COMPAT_FREEBSD32 support for MIPS kernels using the n64 ABI with userlands
using the o32 ABI. This mostly follows nwhitehorn's lead in implementing
COMPAT_FREEBSD32 on powerpc64.
o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the
32-bit ABI being used. Since the MIPS port is relatively-new, even the 32-bit
ABIs use a 64-bit time_t.
o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS
with 32-bit compatibility, then, disable some code which assumes otherwise
wrongly when built for MIPS. A more general macro to check in this case would
seem like a good idea eventually. If someone adds support for using n32
userland with n64 kernels on MIPS, then they will have to add a variety of
flags related to each piece of the ABI that can vary. That's probably the
right time to generalize further.
o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the
freebsd32 compat code. Probably this should be generalized at some point.

Reviewed by: gonzo


# 230249 16-Jan-2012 mckusick

Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks


# 227502 14-Nov-2011 jhb

- Split out a kern_posix_fadvise() from the posix_fadvise() system call so
it can be used by in-kernel consumers.
- Make kern_posix_fallocate() public.
- Use kern_posix_fadvise() and kern_posix_fallocate() to implement the
freebsd32 wrappers for the two system calls.


# 227070 04-Nov-2011 jhb

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 226388 15-Oct-2011 kib

Control the execution permission of the readable segments for
i386 binaries on the amd64 and ia64 with the sysctl, instead of
unconditionally enabling it.

Reviewed by: marcel


# 226364 14-Oct-2011 jhb

Use PAIR32TO64() for the offset and length parameters to
freebsd32_posix_fallocate() to properly handle big-endian platforms.

Reviewed by: mdf
MFC after: 1 week


# 226353 13-Oct-2011 marcel

Use PTRIN().


# 226349 13-Oct-2011 marcel

Wrap mprotect(2) so that we can add execute permissions when read
permissions are requested. This is needed on amd64 and ia64 for
JDK 1.4.x


# 226347 13-Oct-2011 marcel

In freebsd32_mmap() and when compiling for amd64 or ia64, also
ask for execute permissions when read permissions are wanted.
This is needed for JDK 1.4.x on i386.


# 225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 223166 16-Jun-2011 kib

Implement compat32 for old lseek, for the a.out binaries on amd64.


# 220791 18-Apr-2011 mdf

Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by: -arch (previous version)


# 220238 01-Apr-2011 kib

Add support for executing the FreeBSD 1/i386 a.out binaries on amd64.

In particular:
- implement compat shims for old stat(2) variants and ogetdirentries(2);
- implement delivery of signals with ancient stack frame layout and
corresponding sigreturn(2);
- implement old getpagesize(2);
- provide a user-mode trampoline and LDT call gate for lcall $7,$0;
- port a.out image activator and connect it to the build as a module
on amd64.

The changes are hidden under COMPAT_43.

MFC after: 1 month


# 220158 30-Mar-2011 kib

Provide compat32 shims for kldstat(2).

Requested and tested by: jpaetzel
MFC after: 1 week


# 217151 08-Jan-2011 kib

Create shared (readonly) page. Each ABI may specify the use of page by
setting SV_SHP flag and providing pointer to the vm object and mapping
address. Provide simple allocator to carve space in the page, tailored
to put the code with alignment restrictions.

Enable shared page use for amd64, both native and 32bit FreeBSD
binaries. Page is private mapped at the top of the user address
space, moving a start of the stack one page down. Move signal
trampoline code from the top of the stack to the shared page.

Reviewed by: alc


# 215747 23-Nov-2010 pluknet

Update MNT_ROOTFS comments after changes in the root mount logic.

Reported by: arundel
Suggested by: marcel (phrasing)
Approved by: kib (mentor)


# 211412 17-Aug-2010 kib

Supply some useful information to the started image using ELF aux vectors.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.

Tested by: marius (sparc64)
MFC after: 1 month


# 211006 07-Aug-2010 kib

Prefer struct sysentvec sv_psstrings to hardcoding FREEBSD32_PS_STRINGS
in the compat32 code. Use sv_usrstack instead of FREEBSD32_USRSTACK as well.

MFC after: 1 week


# 210848 04-Aug-2010 kib

Copy inode birthtime to the struct stat32.

MFC after: 1 week


# 210847 04-Aug-2010 kib

Fix style.

MFC after: 1 week


# 210796 03-Aug-2010 kib

When compat32 recvmsg(2) does not need to copy out control messages, set
msg_controllen to 0.

PR: kern/149227
Submitted by: Stef Walter <stef memberwebs com>
MFC after: 1 weeks


# 210545 27-Jul-2010 alc

Introduce exec_alloc_args(). The objective being to encapsulate the
details of the string buffer allocation in one place.

Eliminate the portion of the string buffer that was dedicated to storing
the interpreter name. The pointer to the interpreter name can simply be
made to point to the appropriate argument string.

Reviewed by: kib


# 210475 25-Jul-2010 alc

Change the order in which the file name, arguments, environment, and
shell command are stored in exec*()'s demand-paged string buffer. For
a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces
the number of global TLB shootdowns by 31%. It also eliminates about
330k page faults on the kernel address space.

Change exec_shell_imgact() to use "args->begin_argv" consistently as
the start of the argument and environment strings. Previously, it
would sometimes use "args->buf", which is the start of the overall
buffer, but no longer the start of the argument and environment
strings. While I'm here, eliminate unnecessary passing of "&length"
to copystr(), where we don't actually care about the length of the
copied string.

Clean up the initialization of the exec map. In particular, use the
correct size for an entry, and express that size in the same way that
is used when an entry is allocated. The old size was one page too
large. (This discrepancy originated in 2004 when I rewrote
exec_map_first_page() to use sf_buf_alloc() instead of the exec map
for mapping the first page of the executable.)

Reviewed by: kib


# 210431 23-Jul-2010 kib

Remove the linux_exec_copyin_args(), freebsd32_exec_copyin_args() may
server as well. COMPAT_FREEBSD32 is a prerequisite for COMPAT_LINUX32.

Reviewed by: alc
MFC after: 3 weeks


# 210429 23-Jul-2010 alc

Eliminate a little bit of duplicated code.


# 209687 04-Jul-2010 kib

Constify source argument for siginfo_to_siginfo32().

MFC after: 1 week


# 207007 21-Apr-2010 kib

Extract the code to copy-out struct rusage32 from struct rusage
into the new function.

Reviewed by: jhb
MFC after: 1 week


# 205792 28-Mar-2010 ed

Rename st_*timespec fields to st_*tim for POSIX 2008 compliance.

A nice thing about POSIX 2008 is that it finally standardizes a way to
obtain file access/modification/change times in sub-second precision,
namely using struct timespec, which we already have for a very long
time. Unfortunately POSIX uses different names.

This commit adds compatibility macros, so existing code should still
build properly. Also change all source code in the kernel to work
without any of the compatibility macros. This makes it all a less
ambiguous.

I am also renaming st_birthtime to st_birthtim, even though it was a
local extension anyway. It seems Cygwin also has a st_birthtim.


# 205327 19-Mar-2010 kib

Remove empty line.

MFC after: 2 weeks


# 205323 19-Mar-2010 kib

Move SysV IPC freebsd32 compat shims from freebsd32_misc.c to corresponding
sysv_{msg,sem,shm}.c files.

Mark SysV IPC freebsd32 syscalls as NOSTD and add required
SYSCALL_INIT_HELPER/SYSCALL32_INIT_HELPERs to provide auto
register/unregister on module load.

This makes COMPAT_FREEBSD32 functional with SysV IPC compiled and loaded
as modules.

Reviewed by: jhb
MFC after: 2 weeks


# 205322 19-Mar-2010 kib

Move SysV IPC freebsd32 compat shims helpers from freebsd32_misc.c to
sysv_ipc.c.

Reviewed by: jhb
MFC after: 2 weeks


# 205321 19-Mar-2010 kib

Introduce SYSCALL_INIT_HELPER and SYSCALL32_INIT_HELPER macros and
neccessary support functions to allow registering dynamically loaded
syscalls from the MOD_LOAD handlers. Helpers handle registration
failures semi-automatically.

Reviewed by: jhb
MFC after: 2 weeks


# 205319 19-Mar-2010 kib

Make freebsd32_copyiniov() available outside of freebsd32_misc.

MFC after: 2 weeks


# 205014 11-Mar-2010 nwhitehorn

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by: kib, jhb


# 198508 27-Oct-2009 kib

Current pselect(3) is implemented in usermode and thus vulnerable to
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.

Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 198507 27-Oct-2009 kib

In r197963, a race with thread being selected for signal delivery
while in kernel mode, and later changing signal mask to block the
signal, was fixed for sigprocmask(2) and ptread_exit(3). The same race
exists for sigreturn(2), setcontext(2) and swapcontext(2) syscalls.

Use kern_sigprocmask() instead of direct manipulation of td_sigmask to
reschedule newly blocked signals, closing the race.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 198506 27-Oct-2009 kib

In kern_sigsuspend(), better manipulate thread signal mask using
kern_sigprocmask() to properly notify other possible candidate threads
for signal delivery.

Since sigsuspend() shall only return to usermode after a signal was
delivered, do cursig/postsig loop immediately after waiting for
signal, repeating the wait if wakeup was spurious due to race with
other thread fetching signal from the process queue before us. Add
thread_suspend_check() call to allow the thread to be stopped or killed
while in loop.

Modify last argument of kern_sigprocmask() from boolean to flags,
allowing the function to be called with locked proc. Convertion of the
callers that supplied 1 to the old argument will be done in the next
commit, and due to SIGPROCMASK_OLD value equial to 1, code is formally
correct in between.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 197049 09-Sep-2009 kib

kern_select(9) copies fd_set in and out of userspace in quantities of
longs. Since 32bit processes longs are 4 bytes, 64bit kernel may copy in
or out 4 bytes more then the process expected.

Calculate the amount of bytes to copy taking into account size of fd_set
for the current process ABI.

Diagnosed and tested by: Peter Jeremy <peterjeremy acm org>
Reviewed by: jhb
MFC after: 1 week


# 195911 27-Jul-2009 jhb

Fix the freebsd32 versions of semsys(), shmsys(), and msgsys() to use the
old ABI versions of the relevant control system call (e.g.
freebsd7_freebsd32_msgctl() instead of freebsd32_msgctl() for msgsys()).

Approved by: re (kib)


# 195104 27-Jun-2009 rwatson

Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week


# 194910 24-Jun-2009 jhb

Change the ABI of some of the structures used by the SYSV IPC API:
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
(this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
int. This removes the need for the shm_bsegsz member in struct
shmid_kernel and should allow for complete support of SYSV SHM regions
>= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
short.
- The shm_internal member of struct shmid_ds is now gone. The internal
VM object pointer for SHM regions has been moved into struct
shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
now marked COMPAT7 and new versions of those system calls which support
the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc. The
FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
symbol versions has been added to libc. Version tags are added to
system calls by adding an appropriate __sym_compat() entry to
src/lib/libc/incldue/compat.h. [1]

PR: kern/16195 kern/113218 bin/129855
Reviewed by: arch@, rwatson
Discussed with: kan, kib [1]


# 192895 27-May-2009 jamie

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 191673 29-Apr-2009 jamie

Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by: bz (mentor)


# 190466 27-Mar-2009 jamie

Whitespace/spelling fixes in advance of upcoming functional changes.

Approved by: bz (mentor)


# 189290 02-Mar-2009 jamie

Extend the "vfsopt" mount options for more general use. Make struct
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.

Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.

While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.

Approved by: bz (mentor)


# 186564 29-Dec-2008 ed

Push down Giant inside sysctl. Also add some more assertions to the code.

In the existing code we didn't really enforce that callers hold Giant
before calling userland_sysctl(), even though there is no guarantee it
is safe. Fix this by just placing Giant locks around the call to the oid
handler. This also means we only pick up Giant for a very short period
of time. Maybe we should add MPSAFE flags to sysctl or phase it out all
together.

I've also added SYSCTL_LOCK_ASSERT(). We have to make sure sysctl_root()
and name2oid() are called with the sysctl lock held.

Reviewed by: Jille Timmermans <jille quis cx>


# 185589 03-Dec-2008 jhb

When unloading a 32-bit system call module, restore the sysent vector in
the 32-bit system call table instead of the main system call table.


# 185435 29-Nov-2008 bz

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# 184829 10-Nov-2008 peter

Sigh. Fix a pointer/int compile error.


# 184828 10-Nov-2008 peter

Fix a signal emulation bug introduced in r163018 (and present in 7.x).
This prevents 32 bit signal handlers from finding out what the faulting
address is. Both the secret 4th argument and siginfo->si_addr are zero.


# 184183 22-Oct-2008 jhb

Split the copyout of *base at the end of getdirentries() out leaving the
rest in kern_getdirentries(). Use kern_getdirentries() to implement
freebsd32_getdirentries(). This fixes a bug where calls to getdirentries()
in 32-bit binaries would trash the 4 bytes after the 'long base' in
userland.

Submitted by: ups
MFC after: 1 week


# 183365 25-Sep-2008 jhb

Add support for installing 32-bit system calls from kernel modules. This
includes syscall32_{de,}register() routines as well as a module handler
and wrapper macros similar to the support for native syscalls in
<sys/sysent.h>.

MFC after: 1 month


# 183188 19-Sep-2008 obrien

Add freebsd32 compat shim for nmount(2).
(and quiet some compiler warnings for vfs_donmount)


# 183044 15-Sep-2008 obrien

style(9)


# 180436 10-Jul-2008 brooks

style(9): put parentheses around return values.


# 180433 10-Jul-2008 brooks

id_t is a 64-bit integer and thus is passed as two arguments like off_t is.
As a result, those arguments must be recombined before calling the real
syscal implementation. This change fixes 32-bit compatibility for
cpuset_getid(), cpuset_setid(), cpuset_getaffinity(), and
cpuset_setaffinity().


# 177789 31-Mar-2008 kib

Add the freebsd32 compatibility shims for the *at() syscalls.

Reviewed by: rwatson, rdivacky
Tested by: pho


# 174526 10-Dec-2007 jhb

Bah, remove last vestiges of some statfs conversion fixes that aren't quite
ready for CVS yet that snuck into 1.68.

Pointy hat to: jhb


# 174430 07-Dec-2007 scottl

Grrr, remove an unused variable missed in the last commit.


# 174424 07-Dec-2007 scottl

Don't expect a return value from statfs_scale_blocks().


# 174381 06-Dec-2007 jhb

Add freebsd32 compat wrappers for msgctl() and _semctl() using
kern_msgctl() and kern_semctl().

MFC after: 1 week


# 174380 06-Dec-2007 jhb

Move 32-bit SYSV IPC structure definitions into freebsd32_ipc.h.

MFC after: 1 week


# 174377 06-Dec-2007 jhb

Move several data structure definitions out of freebsd32_misc.c and into
freebsd32.h instead.

MFC after: 1 week


# 174268 04-Dec-2007 jkim

Remove redundant checks for msgsnd(3) and msgrcv(3).
COMPAT_IA32 (implicitly) requires SYSVSEM, SYSVSHM and SYSVMSG in kernel.

Pointed out by: jhb


# 172003 28-Aug-2007 jhb

Rework the routines to convert a 5.x+ statfs structure (with fixed-size
64-bit counters) to a 4.x statfs structure (with long-sized counters).
- For block counters, we scale up the block size sufficiently large so
that the resulting block counts fit into a the long-sized (long for the
ABI, so 32-bit in freebsd32) counters. In 4.x the NFS client's statfs
VOP did this already. This can lie about the block size to 4.x binaries,
but it presents a more accurate picture of the ratios of free and
available space.
- For non-block counters, fix the freebsd32 stats converter to cap the
values at INT32_MAX rather than losing the upper 32-bits to match the
behavior of the 4.x statfs conversion routine in vfs_syscalls.c

Approved by: re (kensmith)


# 171215 04-Jul-2007 peter

Add compat6 wrapper code for mmap/lseek/pread/pwrite/truncate/ftruncate.

Approved by: re (kensmith)


# 170870 17-Jun-2007 mjacob

Try a cheap way to get around gcc4.2 believing that user arguments
to system calls can change across intervening functions.


# 169901 23-May-2007 cognet

Remove duplicate includes.

Submitted by: Cyril Nguyen Huu <cyril ci0 org>


# 169181 01-May-2007 alc

Eliminate the use of Giant from ia64-specific code in freebsd32_mmap().


# 165405 20-Dec-2006 jkim

MFP4: (part of) 110058

Fix 32-bit msgsnd(3) and msgrcv(3) emulations for amd64.


# 163018 04-Oct-2006 davidxu

Move some declaration of 32-bit signal structures into file
freebsd32-signal.h, implement sigtimedwait and sigwaitinfo system calls.


# 162954 02-Oct-2006 phk

First part of a little cleanup in the calendar/timezone/RTC handling.

Move relevant variables to <sys/clock.h> and fix #includes as necessary.

Use libkern's much more time- & spamce-efficient BCD routines.


# 162551 22-Sep-2006 davidxu

Add compatible code to let 32bit libthr work on 64bit kernel.


# 161343 15-Aug-2006 jkim

Include sys/limits.h for INT_MAX. freebsd32_proto.h 1.58 does not include
sys/umtx.h any more and previously it was included from there.


# 160249 10-Jul-2006 jhb

- Split out kern_accept(), kern_getpeername(), and kern_getsockname() for
use by ABI emulators.
- Alter the interface of kern_recvit() somewhat. Specifically, go ahead
and hard code UIO_USERSPACE in the uio as that's what all the callers
specify. In place, add a new uioseg to indicate what type of pointer
is in mp->msg_name. Previously it was always a userland address, but
ABI emulators may pass in kernel-side sockaddrs. Also, remove the
namelenp field and instead require the two places that used it to
explicitly copy mp->msg_namelen out to userland.
- Use the patched kern_recvit() to replace svr4_recvit() and the stock
kern_sendit() to replace svr4_sendit().
- Use kern_bind() instead of stackgap use in ti_bind().
- Use kern_getpeername() and kern_getsockname() instead of stackgap in
svr4_stream_ti_ioctl().
- Use kern_connect() instead of stackgap in svr4_do_putmsg().
- Use kern_getpeername() and kern_accept() instead of stackgap in
svr4_do_getmsg().
- Retire the stackgap from SVR4 compat as it is no longer used.


# 160246 10-Jul-2006 jhb

Unexpand PTRIN() in several places and fix one instance where 0 was being
used instead of NULL.


# 159412 08-Jun-2006 ps

Do not copy out the iovec in the 32bit recvmsg call since soreceive
calls uiomove directly.

Reviewed by: ups
MFC after: 1 week


# 157285 30-Mar-2006 ps

Properly support for FreeBSD 4 32bit System V shared memory.

Submitted by: peter
Obtained from: Yahoo!
MFC after: 3 weeks


# 156440 08-Mar-2006 ups

Fix exec_map resource leaks.

Tested by: kris@


# 156266 03-Mar-2006 ps

use strlcpy in cvtstatfs and copy_statfs instead of bcopy to ensure
the copied strings are properly terminated.

bzero the statfs32 struct in copy_statfs.


# 156114 28-Feb-2006 ps

Fix 32bit sendfile by implementing kern_sendfile so that it takes
the header and trailers as iovec arguments instead of copying them
in inside of sendfile.

Reviewed by: jhb
MFC after: 3 weeks


# 155402 06-Feb-2006 jhb

- Always call exec_free_args() in kern_execve() instead of doing it in all
the callers if the exec either succeeds or fails early.
- Move the code to call exit1() if the exec fails after the vmspace is
gone to the bottom of kern_execve() to cut down on some code duplication.


# 154586 20-Jan-2006 ambrisko

Add 32bit version of lutimes so untar doesn't mess up sym-links on amd64.


# 153247 08-Dec-2005 ambrisko

Add 32bit version of futimes so untar doesn't result in bad dates
(Jan 1, 1970) when run on amd64.

Reviewed by: ps


# 152134 06-Nov-2005 ps

Copy out the number of iovecs in freebsd32_recvmsg, not the length
of a single iovec.


# 151909 31-Oct-2005 ps

Reformat socket control messages on input/output for 32bit compatibility
on 64bit systems.

Submitted by: ps, ups
Reviewed by: jhb


# 151720 26-Oct-2005 peter

There is no 'freebsd3_' prefix for COMPAT_43 syscalls. Those are all
bundled under MCOMPAT and have an 'o' prefix. Adjust as appropriate.
This re-enables compiling without COMPAT_43 again.


# 151582 23-Oct-2005 ps

Implement for FreeBSD 3 32 binaries:
sigaction, sigprocmask, sigpending, sigvec, sigblock, sigsetmask,
sigsuspend, sigstack


# 151359 15-Oct-2005 ps

Implement the 32bit versions of recvmsg, recvfrom, sendmsg

Partially obtained from: jhb


# 151357 15-Oct-2005 ps

Implement 32bit wrappers for clock_gettime, clock_settime, and
clock_getres.


# 151355 15-Oct-2005 ps

Correct the prototype for freebsd32_nanosleep and use the proper
size when copying struct timespec32 in and out.


# 150883 03-Oct-2005 jhb

Use the constants for the syscall names from syscall.h rather than
hardcoding the numbers for the SYSVIPC syscalls.


# 147964 13-Jul-2005 jhb

Wrap the ia64-specific freebsd32_mmap_partial() hack in Giant for now
since it calls into VFS and VM. This makes the freebsd32_mmap() routine
MP safe and the extra Giants here can be revisited later.

Glanced at by: marcel
MFC after: 3 days


# 147813 07-Jul-2005 jhb

- Add two new system calls: preadv() and pwritev() which are like readv()
and writev() except that they take an additional offset argument and do
not change the current file position. In SAT speak:
preadv:readv::pread:read and pwritev:writev::pwrite:write.
- Try to reduce code duplication some by merging most of the old
kern_foov() and dofilefoo() functions into new dofilefoo() functions
that are called by kern_foov() and kern_pfoov(). The non-v functions
now all generate a simple uio on the stack from the passed in arguments
and then call kern_foov(). For example, read() now just builds a uio and
calls kern_readv() and pwrite() just builds a uio and calls kern_pwritev().

PR: kern/80362
Submitted by: Marc Olzheim marcolz at stack dot nl (1)
Approved by: re (scottl)
MFC after: 1 week


# 147654 29-Jun-2005 jhb

- Change the commented out freebsd32_xxx() example to use kern_xxx() along
with a single copyin() + translate and translate + copyout() rather than
using the stackgap.
- Remove implementation of the stackgap for freebsd32 since it is no longer
used for that compat ABI.

Approved by: re (scottl)


# 147588 24-Jun-2005 jhb

Correct the amount of data to allocate in these local copies of
exec_copyin_strings() to catch up to rev 1.266 of kern_exec.c. This fixes
panics on amd64 with compat binaries since exec_free_args() was freeing
more memory than these functions were allocating and the mismatch could
cause memory to be freed out from under other concurrent execs.

Approved by: re (scottl)


# 147302 11-Jun-2005 pjd

Do not allocate memory based on not-checked argument from userland.
It can be used to panic the kernel by giving too big value.
Fix it by moving allocation and size verification into kern_getfsstat().
This even simplifies kern_getfsstat() consumers, but destroys symmetry -
memory is allocated inside kern_getfsstat(), but has to be freed by the
caller.

Found by: FreeBSD Kernel Stress Test Suite: http://www.holm.cc/stress/
Reported by: Peter Holm <peter@holm.cc>


# 147178 09-Jun-2005 pjd

Avoid code duplication in serval places by introducing universal
kern_getfsstat() function.

Obtained from: jhb


# 146950 03-Jun-2005 ps

Wrap copyin/copyout for kevent so the 32bit wrapper does not have
to malloc nchanges * sizeof(struct kevent) AND/OR nevents *
sizeof(struct kevent) on every syscall.

Glanced at by: peter, jmg
Obtained from: Yahoo!
MFC after: 2 weeks


# 146583 24-May-2005 ps

Copyout to userland if kern_sigaction succeeds


# 144450 31-Mar-2005 jhb

- Use a custom version of copyinuio() to implement readv/writev using
kern_readv/writev.
- Use kern_settimeofday() and kern_adjtime() rather than stackgapping it.


# 142934 01-Mar-2005 ps

Use kern_kevent instead of the stackgap for 32bit syscall wrapping.

Submitted by: jhb
Tested on: amd64


# 142918 01-Mar-2005 ps

Ooops. I will compile test before committing. The stackgap version
of kevent32 will be going away shortly, so this is temporary until
I commit the non-stackgap version.


# 142059 18-Feb-2005 jhb

- Add a custom version of exec_copyin_args() to deal with the 32-bit
pointers in argv and envv in userland and use that together with
kern_execve() and exec_free_args() to implement freebsd32_execve()
without using the stackgap.
- Fix freebsd32_adjtime() to call adjtime() rather than utimes(). Still
uses stackgap for now.
- Use kern_setitimer(), kern_getitimer(), kern_select(), kern_utimes(),
kern_statfs(), kern_fstatfs(), kern_fhstatfs(), kern_stat(),
kern_fstat(), and kern_lstat().

Tested by: cokane (amd64)
Silence on: amd64, ia64


# 140481 19-Jan-2005 ps

- rename nanosleep1 to kern_nanosleep
- Add a 32bit syscall entry for nanosleep

Reviewed by: peter
Obtained from: Yahoo!


# 138129 27-Nov-2004 das

Don't include sys/user.h merely for its side-effect of recursively
including other headers.


# 136404 11-Oct-2004 peter

Put on my peril sensitive sunglasses and add a flags field to the internal
sysctl routines and state. Add some code to use it for signalling the need
to downconvert a data structure to 32 bits on a 64 bit OS when requested by
a 32 bit app.

I tried to do this in a generic abi wrapper that intercepted the sysctl
oid's, or looked up the format string etc, but it was a real can of worms
that turned into a fragile mess before I even got it partially working.

With this, we can now run 'sysctl -a' on a 32 bit sysctl binary and have
it not abort. Things like netstat, ps, etc have a long way to go.

This also fixes a bug in the kern.ps_strings and kern.usrstack hacks.
These do matter very much because they are used by libc_r and other things.


# 136152 05-Oct-2004 jhb

Rework how we store process times in the kernel such that we always store
the raw values including for child process statistics and only compute the
system and user timevals on demand.

- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.

Submitted by: bde (mostly)
MFC after: 1 month


# 130640 17-Jun-2004 phk

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


# 128597 24-Apr-2004 marcel

Fix build for non-COMPAT_FREEBSD4 configurations. Make the FreeBSD 4
statfs functions conditional upon the option.


# 128260 14-Apr-2004 peter

Catch up to the not-so-recent statfs(2) changes.


# 127140 17-Mar-2004 jhb

- Replace wait1() with a kern_wait() function that accepts the pid,
options, status pointer and rusage pointer as arguments. It is up to
the caller to copyout the status and rusage to userland if needed. This
lets us axe the 'compat' argument and hide all that functionality in
owait(), by the way. This also cleans up some locking in kern_wait()
since it no longer has to drop locks around copyout() since all the
copyout()'s are deferred.
- Convert owait(), wait4(), and the various ABI compat wait() syscalls to
use kern_wait() rather than wait1() or wait4(). This removes a bit
more stackgap usage.

Tested on: i386
Compiled on: i386, alpha, amd64


# 125171 28-Jan-2004 peter

Regen


# 123746 23-Dec-2003 peter

Rather than screw around with the (unsafe) stackgap, call vn_stat/fo_stat
directly for stat/fstat/lstat syscall emulation. It turns out not only
safer, but the code is smaller this way too.


# 123744 23-Dec-2003 peter

Eliminate stackgap usage for the (woefully incomplete) path translations
since it isn't needed here anymore.
Use standard open(2)/access(2) and chflags(2) syscalls now.


# 123425 11-Dec-2003 peter

Just implementing a 32 bit version of gettimeofday() was smaller than
the wrapper code. And it doesn't use the stackgap as a bonus.


# 122253 07-Nov-2003 peter

Dont write to the stackgap directly in execve().


# 121719 30-Oct-2003 peter

Add CTASSERT()'s to check that the sizes of our replicas of the 32 bit
structures come out the right size.

Fix the ones that broke. stat32 had some missing fields from the end
and statfs32 was broken due to the strange definition of MNAMELEN
(which is dependent on sizeof(long))

I'm not sure if this fixes any actual problems or not.


# 119336 22-Aug-2003 peter

Switch to using the emulator in the common compat area.
Still work-in-progress.


# 119333 22-Aug-2003 peter

Initial sweep to de-i386-ify this


# 118031 25-Jul-2003 obrien

Use __FBSDID().

Brought to you by: a boring talk at Ottawa Linux Symposium


# 114987 14-May-2003 peter

Add BASIC i386 binary support for the amd64 kernel. This is largely
stolen from the ia64/ia32 code (indeed there was a repocopy), but I've
redone the MD parts and added and fixed a few essential syscalls. It
is sufficient to run i386 binaries like /bin/ls, /usr/bin/id (dynamic)
and p4. The ia64 code has not implemented signal delivery, so I had
to do that.

Before you say it, yes, this does need to go in a common place. But
we're in a freeze at the moment and I didn't want to risk breaking ia64.
I will sort this out after the freeze so that the common code is in a
common place.

On the AMD64 side, this required adding segment selector context switch
support and some other support infrastructure. The %fs/%gs etc code
is hairy because loading %gs will clobber the kernel's current MSR_GSBASE
setting. The segment selectors are not used by the kernel, so they're only
changed at context switch time or when changing modes. This still needs
to be optimized.

Approved by: re (amd64/* blanket)


# 113859 22-Apr-2003 jhb

- Replace inline implementations of sigprocmask() with calls to
kern_sigprocmask() in the various binary compatibility emulators.
- Replace calls to sigsuspend(), sigaltstack(), sigaction(), and
sigprocmask() that used the stackgap with calls to the corresponding
kern_sig*() functions instead without using the stackgap.


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 111002 16-Feb-2003 phk

Remove #include <sys/dkstat.h>


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 107849 13-Dec-2002 alfred

SCARGS removal take II.


# 107839 13-Dec-2002 alfred

Backout removal SCARGS, the code freeze is only "selectively" over.


# 107838 13-Dec-2002 alfred

Remove SCARGS.

Reviewed by: md5


# 104738 09-Oct-2002 peter

Try and deal with the #ifdef COMPAT_FREEBSD4 sendfile stuff. This would
have been a lot easier if do_sendfile() was usable externally.


# 100384 20-Jul-2002 peter

Infrastructure tweaks to allow having both an Elf32 and an Elf64 executable
handler in the kernel at the same time. Also, allow for the
exec_new_vmspace() code to build a different sized vmspace depending on
the executable environment. This is a big help for execing i386 binaries
on ia64. The ELF exec code grows the ability to map partial pages when
there is a page size difference, eg: emulating 4K pages on 8K or 16K
hardware pages.

Flesh out the i386 emulation support for ia64. At this point, the only
binary that I know of that fails is cvsup, because the cvsup runtime
tries to execute code in pages not marked executable.

Obtained from: dfr (mostly, many tweaks from me).