History log of /freebsd-current/sys/kern/imgact_elf.c
Revision Date Author Comments
# 364d1b2f 04-Mar-2024 John Baldwin <jhb@FreeBSD.org>

imgact_elf: Add const to the checknote parameter to __elfN(parse_notes)

Reviewed by: imp, kib
Sponsored by: University of Cambridge, Google, Inc.
Differential Revision: https://reviews.freebsd.org/D44215


# 169641f7 04-Mar-2024 Alex Richardson <arichardson@FreeBSD.org>

imgact_elf: Add const to a few struct image_params pointers

This makes it more obvious which functions modify fields in this struct.

Reviewed by: imp, kib
Obtained from: CheriBSD
Differential Revision: https://reviews.freebsd.org/D44214


# 29d4f8bf 09-Feb-2024 Konstantin Belousov <kib@FreeBSD.org>

ELF note parser: provide more info on failure

Print reasons when parser declined to parse notes, due to mis-alignment,
invalid length, or too many notes (the later typically means that there
is a loop). Also increase the loop limit to 4096, which gives enough
iterations for notes to fill whole notes' page.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# a67edb56 09-Feb-2024 Konstantin Belousov <kib@FreeBSD.org>

imgact_elf.c: remove sys/cdefs.h include

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# eb32c1c7 02-Nov-2023 Andrew Turner <andrew@FreeBSD.org>

sysent: Add sv_protect

To allow for architecture specific protections add sv_protect to struct
sysent. This can be used to apply these after the executable is loaded
into the new address space.

Reviewed by: kib
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D42440


# a04633ce 01-Nov-2023 Andrew Turner <andrew@FreeBSD.org>

imgact_elf: Export __elfN(parse_notes)

This is useful to check if a note is present and contains an expected
value, e.g. to read NT_GNU_PROPERTY_TYPE_0 on arm64 to see if we should
enable BTI.

Reviewed by: kib, markj
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D42439


# 9d2612fc 01-Nov-2023 Andrew Turner <andrew@FreeBSD.org>

imgact_elf: Move GNU_ABI_VENDOR to a common header

Move the definition of GNU_ABI_VENDOR to a common location so it can
be used in multiple files.

Reviewed by: emaste, kib, imp
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D42442


# 326bf508 26-Oct-2023 Brooks Davis <brooks@FreeBSD.org>

auxv: make AT_BSDFLAGS unsigned

AT_BSDFLAGS shouldn't be sign extended on 64-bit systems so use a
uint32_t instead of an int.

Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D42365


# 1798b44f 24-Oct-2023 Konstantin Belousov <kib@FreeBSD.org>

user stack randomization: only enable by default for 64bit processes

All aslr knobs are disabled by default for 32bit processes, except
stack. This results in weird stack location, typically making around 1G
of user address space hard to use.

Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42356


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 659a0041 30-May-2023 Jessica Clarke <jrtc27@FreeBSD.org>

imgact: Make et_dyn_addr part of image_params

This already gets passed around between various imgact_elf functions, so
moving it removes an argument from all those places. A future commit
will make use of this for hwpmc, though, to provide the load base for
PIEs, which currently isn't available to tools like pmcstat.

Reviewed by: kib, markj, jhb
Differential Revision: https://reviews.freebsd.org/D39594


# 57578dea 29-May-2023 Dmitry Chagin <dchagin@FreeBSD.org>

Brandinfo: Retire emul_path as unneeded anymore

The Barndinfo emul_path was used by the Elf image activator to fixup
interpreter file name according to ABI root directory. Since the
non-native ABI can now specify its root directory directly to namei()
via pwd_altroot() call this facility is not needed anymore.

Differential Revision: https://reviews.freebsd.org/D40091
MFC after: 2 month


# ff41239f 12-Sep-2022 Konstantin Belousov <kib@FreeBSD.org>

Add AT_USRSTACK{BASE, LIM} AT vectors, and ELF_BSDF_VMNOOVERCOMMIT flag

Reviewed by: brooks, imp (previous version)
Discussed with: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36540


# fbafa98a 18-Mar-2022 Ed Maste <emaste@FreeBSD.org>

Disallow invalid PT_GNU_STACK

Stack must be at least readable and writable.

PR: 242570
Reviewed by: kib, markj
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35867


# 00d17cf3 03-Jun-2022 Konstantin Belousov <kib@FreeBSD.org>

elf_note_prpsinfo: handle more failures from proc_getargv()

Resulting sbuf_len() from proc_getargv() might return 0 if user mangled
ps_strings enough. Also, sbuf_len() API contract is to return -1 if the
buffer overflowed. The later should not occur because get_ps_strings()
checks for catenated length, but check for this subtle detail explicitly
as well to be more resilent.

The end result is that p_comm is used in this situations.

Approved by: so
Security: FreeBSD-SA-22:09.elf
Reported by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: delphij, markj
admbugs: 988
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35391


# f0687f3e 05-Aug-2022 Ed Maste <emaste@FreeBSD.org>

Clarify code comments on ASLR default settings

Sponsored by: The FreeBSD Foundation


# 939f0b63 10-May-2022 Kornel Dulęba <kd@FreeBSD.org>

Implement shared page address randomization

It used to be mapped at the top of the UVA.
If the randomization is enabled any address above .data section will be
randomly chosen and a guard page will be inserted in the shared page
default location.
The shared page is now mapped in exec_map_stack, instead of
exec_new_vmspace. The latter function is called before image activator
has a chance to parse ASLR related flags.
The KERN_PROC_VM_LAYOUT sysctl was extended to provide shared page
address.
The feature is enabled by default for 64 bit applications on all
architectures.
It can be toggled kern.elf64.aslr.shared_page sysctl.

Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35349


# 361971fb 02-Jun-2022 Kornel Dulęba <kd@FreeBSD.org>

Rework how shared page related data is stored

Store the shared page address in struct vmspace.
Also instead of storing absolute addresses of various shared page
segments save their offsets with respect to the shared page address.
This will be more useful when the shared page address is randomized.

Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35393


# 0288d427 30-Jun-2022 John Baldwin <jhb@FreeBSD.org>

Add register sets for NT_THRMISC and NT_PTLWPINFO.

For the kernel this is mostly a non-functional change. However, this
will be useful for simplifying gcore(1).

Reviewed by: markj
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D35666


# bb92cd7b 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)


# 1babcad6 22-Mar-2022 Mark Johnston <markj@FreeBSD.org>

elf: Avoid dumping uninitialized bytes in PRSTATUS core dump notes

elf_prstatus_t contains pad space.

Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34606


# 6b71405b 10-Mar-2022 John Baldwin <jhb@FreeBSD.org>

Store core dump notes for all valid register sets for FreeBSD processes.

In particular, use a generic wrapper around struct regset rather than
requiring per-regset helpers. This helper replaces the MI
__elfN(note_prstatus) and __elfN(note_fpregset) helpers. It also
removes the need to explicitly dump NT_ARM_ADDR_MASK in the arm64
__elfN(dump_thread).

Reviewed by: markj, emaste
Sponsored by: University of Cambridge, Google, Inc.
Differential Revision: https://reviews.freebsd.org/D34446


# 0b25cbc7 03-Mar-2022 John Baldwin <jhb@FreeBSD.org>

Fix the size returned for NT_FPREGSET.

Sponsored by: University of Cambridge, Google, Inc.


# 548a2ec4 24-Jan-2022 Andrew Turner <andrew@FreeBSD.org>

Add PT_GETREGSET

This adds the PT_GETREGSET and PT_SETREGSET ptrace types. These can be
used to access all the registers from a specified core dump note type.
The NT_PRSTATUS and NT_FPREGSET notes are initially supported. Other
machine-dependant types are expected to be added in the future.

The ptrace addr points to a struct iovec pointing at memory to hold the
registers along with its length. On success the length in the iovec is
updated to tell userspace the actual length the kernel wrote or, if the
base address is NULL, the length the kernel would have written.

Because the data field is an int the arguments are backwards when
compared to the Linux PTRACE_GETREGSET call.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19831


# 1811c1e9 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

exec: Reimplement stack address randomization

The approach taken by the stack gap implementation was to insert a
random gap between the top of the fixed stack mapping and the true top
of the main process stack. This approach was chosen so as to avoid
randomizing the previously fixed address of certain process metadata
stored at the top of the stack, but had some shortcomings. In
particular, mlockall(2) calls would wire the gap, bloating the process'
memory usage, and RLIMIT_STACK included the size of the gap so small
(< several MB) limits could not be used.

There is little value in storing each process' ps_strings at a fixed
location, as only very old programs hard-code this address; consumers
were converted decades ago to use a sysctl-based interface for this
purpose. Thus, this change re-implements stack address randomization by
simply breaking the convention of storing ps_strings at a fixed
location, and randomizing the location of the entire stack mapping.
This implementation is simpler and avoids the problems mentioned above,
while being unlikely to break compatibility anywhere the default ASLR
settings are used.

The kern.elfN.aslr.stack_gap sysctl is renamed to kern.elfN.aslr.stack,
and is re-enabled by default.

PR: 260303
Reviewed by: kib
Discussed with: emaste, mw
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# 758d98de 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

exec: Remove the stack gap implementation

ASLR stack randomization will reappear in a forthcoming commit. Rather
than inserting a random gap into the stack mapping, the entire stack
mapping itself will be randomized in the same way that other mappings
are when ASLR is enabled.

No functional change intended, as the stack gap implementation is
currently disabled by default.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# 706f4a81 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

exec: Introduce the PROC_PS_STRINGS() macro

Rather than fetching the ps_strings address directly from a process'
sysentvec, use this macro. With stack address randomization the
ps_strings address is no longer fixed.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# bfd45121 14-Dec-2021 Mark Johnston <markj@FreeBSD.org>

imgact_elf: Disable the stack gap for now

The integration with RLIMIT_STACK is still causing problems for some
programs such as lang/sdcc and syzkaller's executor. Until this is
resolved by some work currently in progress, disable the stack gap by
default.

PR: 260303
Reviewed by: kib, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33438


# e499988f 12-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

exec_elf: use intermediate u_long variable to correct mismatched type

vm_offset_t * vs. u_long *

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# bf839416 08-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

imgact_elf: avoid mapsz overflow

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33359


# 36df8f54 09-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

imgact_elf: check that the alignment of PT_LOAD segment is power of two

and stop recalculating alignment for PIE base, which was off by one
power of two.

Suggested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33359


# 714d6d09 08-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

imgact_elf: exclude invalid alignment requests

Only accept at most superpage alignment, or if the arch does not have
superpages supported, artificially limit it to PAGE_SIZE * 1024.
This is somewhat arbitrary, and e.g. could change what binaries do
we accept between native i386 vs. amd64 ia32 with superpages disabled,
but I do not believe the difference there is affecting anybody with
real (useful) binaries.

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33359


# a4007ae1 09-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

rnd_elf: add comment explaining the interface

Requested and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33359


# 9cf78c1c 07-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

elf image activator: convert asserts into errors

Invalid (artificial) layout of the loadable ELF segments might result in
triggering the assertion. This means that the file should not be
executed, regardless of the kernel debug mode. Change calling
conventions for rnd_elf{32,64} helpers to allow returning an error, and
abort activation with ENOEXEC if its invariants are broken.

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33359


# b4b20492 09-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

exec_elf: assert that the image vnode is still locked on return

Suggested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33359


# 88dd7a0a 08-Dec-2021 Konstantin Belousov <kib@FreeBSD.org>

Style

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33359


# eb029587 24-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

Add kern.elf{32,64}.vdso knobs to enable/disable vdso preloading

Reviewed by: emaste
Discussed with: jrtc27
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D32960


# 01c77a43 11-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

Pass vdso address to userspace

Reviewed by: emaste
Discussed with: jrtc27
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D32960


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# 4082b189 17-Nov-2021 Alex Richardson <arichardson@FreeBSD.org>

elf*_brand_inuse: Change return type to bool.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33052


# 19621645 17-Nov-2021 Alex Richardson <arichardson@FreeBSD.org>

imgact_elf: Use bool instead of boolean_t.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33051


# b014e0f1 24-Oct-2021 Marcin Wojtas <mw@FreeBSD.org>

Enable ASLR by default for 64-bit executables

Address Space Layout Randomization (ASLR) is an exploit mitigation
technique implemented in the majority of modern operating systems.
It involves randomly positioning the base address of an executable
and the position of libraries, heap, and stack, in a process's address
space. Although over the years ASLR proved to not guarantee full OS
security on its own, this mechanism can make exploitation more difficult.

Tests on the tier 1 64-bit architectures demonstrated that the ASLR is
stable and does not result in noticeable performance degradation,
therefore it should be safe to enable this mechanism by default.
Moreover its effectiveness is increased for PIE (Position Independent
Executable) binaries. Thanks to commit 9a227a2fd642 ("Enable PIE by
default on 64-bit architectures"), building from src is not necessary
to have PIE binaries. It is enough to control usage of ASLR in the
OS solely by setting the appropriate sysctls.

This patch toggles the kernel settings to use address map randomization
for PIE & non-PIE 64-bit binaries. It also disables SBRK, in order
to allow utilization of the bss grow region for mappings. The latter
has no effect if ASLR is disabled, so apply it to all architectures.

As for the drawbacks, a consequence of using the ASLR is more
significant VM fragmentation, hence the issues may be encountered
in the systems with a limited address space in high memory consumption
cases, such as buildworld. As a result, although the tests on 32-bit
architectures with ASLR enabled were mostly on par with what was
observed on 64-bit ones, the defaults for the former are not changed
at this time. Also, for the sake of safety keep the feature disabled
for 32-bit executables on 64-bit machines, too.

The committed change affects the overall OS operation, so the
following should be taken into consideration:
* Address space fragmentation.
* A changed ABI due to modified layout of address space.
* More complicated debugging due to:
* Non-reproducible address space layout between runs.
* Some debuggers automatically disable ASLR for spawned processes,
making target's environment different between debug and
non-debug runs.

In order to confirm/rule-out the dependency of any encountered issue
on ASLR it is strongly advised to re-run the test with the feature
disabled - it can be done by setting the following sysctls
in the /etc/sysctl.conf file:
kern.elf64.aslr.enable=0
kern.elf64.aslr.pie_enable=0

Co-developed by: Dawid Gorecki <dgr@semihalf.com>
Reviewed by: emaste, kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D27666


# 889b56c8 13-Oct-2021 Dawid Gorecki <dgr@semihalf.com>

setrlimit: Take stack gap into account.

Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.

Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.

PR: 253208
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31516


# 796a8e1a 01-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

procctl(2): Add PROC_WXMAP_CTL/STATUS

It allows to override kern.elf{32,64}.allow_wx on per-process basis.
In particular, it makes it possible to run binaries without PT_GNU_STACK
and without elfctl note while allow_wx = 0.

Reviewed by: brooks, emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31779


# b7924341 27-Aug-2021 Andrew Turner <andrew@FreeBSD.org>

Create sys/reg.h for the common code previously in machine/reg.h

Move the common kernel function signatures from machine/reg.h to a new
sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2).

Reviewed by: imp, markj
Sponsored by: DARPA, AFRL (original work)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19830


# ebf98866 23-Jul-2021 Mark Johnston <markj@FreeBSD.org>

imgact_elf: Avoid redefining suword()

Otherwise this interferes with the definition for sanitizer
interceptors.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 5d9f7901 29-Jun-2021 Dmitry Chagin <dchagin@FreeBSD.org>

Eliminate p_elf_machine from struct proc.

Instead of p_elf_machine use machine member of the Elf_Brandinfo which is now
cached in the struct proc at p_elf_brandinfo member.

Note to MFC: D30918, KBI

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D30926
MFC after: 2 weeks


# 615f22b2 29-Jun-2021 Dmitry Chagin <dchagin@FreeBSD.org>

Add a link to the Elf_Brandinfo into the struc proc.

To allow the ABI to make a dicision based on the Brandinfo add a link
to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits
of Elf_Brandinfo flags is private to the ABI.

Note to MFC: it breaks KBI.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D30918
MFC after: 2 weeks


# 435754a5 29-Jun-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add infrastructure required for Linux coredump support

This adds `sv_elf_core_osabi`, `sv_elf_core_abi_vendor`,
and `sv_elf_core_prepare_notes` fields to `struct sysentvec`,
and modifies imgact_elf.c to make use of them instead
of hardcoding FreeBSD-specific values. It also updates all
of the ABI definitions to preserve current behaviour.

This makes it possible to implement non-native ELF coredump
support without unnecessary code duplication. It will be used
for Linux coredumps.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30921


# 61b4c627 22-Jun-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

imgact_elf.c: style, remove unnecessary casts

Remove unnecessary type casts and redundant brackets.
No functional changes.

Suggested By: kib
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30841


# 06250515 21-Jun-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

imgact_elf: compute auxv buffer size instead of using magic value

The new buffer is somewhat larger, but there should be no functional
changes.

Reviewed By: kib, imp
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30821


# 905d192d 26-May-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

Unstaticize parts of coredumping code

This makes it possible to call __elfN(size_segments) and __elfN(puthdr)
from Linux coredump code.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30455


# 33621dfc 22-May-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

Refactor core dumping code a bit

This makes it possible to use core_write(), core_output(),
and sbuf_drain_core_output(), in Linux coredump code. Moving
them out of imgact_elf.c is necessary because of the weird way
it's being built.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30369


# 86ffb3d1 24-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

ELF coredump: define several useful flags for the coredump operations

- SVC_ALL request dumping all map entries, including those marked as
non-dumpable
- SVC_NOCOMPRESS disallows compressing the dump regardless of the sysctl
policy
- SVC_PC_COREDUMP is provided for future use by userspace core dump
request

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29955


# 5bc3c617 24-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

imgact_elf: consistently pass flags from coredump down to helper functions

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29955


# 409ab7e1 26-Apr-2021 Mark Johnston <markj@FreeBSD.org>

imgact_elf: Ensure that the return value in parse_notes is initialized

parse_notes relies on the caller-supplied callback to initialize "res".
Two callbacks are used in practice, brandnote_cb and note_fctl_cb, and
the latter fails to initialize res. Fix it.

In the worst case, the bug would cause the inner loop of check_note to
examine more program headers than necessary, and the note header usually
comes last anyway.

Reviewed by: kib
Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29986


# 41032835 14-Feb-2021 Jason A. Harmening <jah@FreeBSD.org>

Fix divide-by-zero panic when ASLR is enabled and superpages disabled

When locating the anonymous memory region for a vm_map with ASLR
enabled, we try to keep the slid base address aligned on a superpage
boundary to minimize pagetable fragmentation and maximize the potential
usage of superpage mappings. We can't (portably) do this if superpages
have been disabled by loader tunable and pagesizes[1] is 0, and it
would be less beneficial in that case anyway.

PR: 253511
Reported by: johannes@jo-t.de
MFC after: 1 week
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28678


# 0659df6f 12-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_map_protect: allow to set prot and max_prot in one go.

This prevents a situation where other thread modifies map entries
permissions between setting max_prot, then relocking, then setting prot,
confusing the operation outcome. E.g. you can get an error that is not
possible if operation is performed atomic.

Also enable setting rwx for max_prot even if map does not allow to set
effective rwx protection.

Reviewed by: brooks, markj (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28117


# 2e1c94aa 08-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

Implement enforcing write XOR execute mapping policy.

It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.

Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050


# 4daea938 31-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Lock proctree in around fill_kinfo_proc().

Proctree lock is needed for correct calculation and collection of the
job-control related data in kinfo_proc. There was even an XXX comment
about it.

Satisfy locking and lock ordering requirements by taking proctree lock
around pass over each bucket in proc_iterate(), and in sysctl_kern_proc()
and note_procstat_proc() for individual process reporting.

Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871


# 673e2dd6 18-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Add ELF flag to disable ASLR stack gap.

Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().

PR: 239873
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 85078b85 17-Nov-2020 Conrad Meyer <cem@FreeBSD.org>

Split out cwd/root/jail, cmask state from filedesc table

No functional change intended.

Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).

__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.

Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037


# f8e8a06d 10-Oct-2020 Conrad Meyer <cem@FreeBSD.org>

random(4) FenestrasX: Push root seed version to arc4random(3)

Push the root seed version to userspace through the VDSO page, if
the RANDOM_FENESTRASX algorithm is enabled. Otherwise, there is no
functional change. The mechanism can be disabled with
debug.fxrng_vdso_enable=0.

arc4random(3) obtains a pointer to the root seed version published by
the kernel in the shared page at allocation time. Like arc4random(9),
it maintains its own per-process copy of the seed version corresponding
to the root seed version at the time it last rekeyed. On read requests,
the process seed version is compared with the version published in the
shared page; if they do not match, arc4random(3) reseeds from the
kernel before providing generated output.

This change does not implement the FenestrasX concept of PCPU userspace
generators seeded from a per-process base generator. That change is
left for future discussion/work.

Reviewed by: kib (previous version)
Approved by: csprng (me -- only touching FXRNG here)
Differential Revision: https://reviews.freebsd.org/D22839


# c88285c5 02-Oct-2020 Mark Johnston <markj@FreeBSD.org>

Fix the INVARIANTS build for 32-bit platforms

Reported by: Jenkins
MFC with: r366368


# f31695cc 02-Oct-2020 Mark Johnston <markj@FreeBSD.org>

Implement sparse core dumps

Currently we allocate and map zero-filled anonymous pages when dumping
core. This can result in lots of needless disk I/O and page
allocations. This change tries to make the core dumper more clever and
represent unbacked ranges of virtual memory by holes in the core dump
file.

Add a new page fault type, VM_FAULT_NOFILL, which causes vm_fault() to
clean up and return an error when it would otherwise map a zero-filled
page. Then, in the core dumper code, prefault all user pages and handle
errors by simply extending the size of the core file. This also fixes a
bug related to the fact that vn_io_fault1() does not attempt partial I/O
in the face of errors from vm_fault_quick_hold_pages(): if a truncated
file is mapped into a user process, an attempt to dump beyond the end of
the file results in an error, but this means that valid pages
immediately preceding the end of the file might not have been dumped
either.

The change reduces the core dump size of trivial programs by a factor of
ten simply by excluding unaccessed libc.so pages.

PR: 249067
Reviewed by: kib
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26590


# fec41f07 02-Oct-2020 Mark Johnston <markj@FreeBSD.org>

Simplify the check for non-dumpable VM object types

OBJT_DEFAULT, _SWAP, _VNODE and _PHYS is exactly the set of
non-fictitious object types, so just check for OBJ_FICTITIOUS. The
check no longer excludes dead objects, but such objects have to be
handled regardless.

No functional change intended.

Reviewed by: alc, dougm, kib
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26589


# 7de1bc13 07-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

imgact_elf.c: unify check for phdr fitting into the first page.

Similar to the userspace rtld check.

Reviewed by: dim, emaste (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26339


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# 0cad2aa2 23-Aug-2020 Konstantin Belousov <kib@FreeBSD.org>

Pass pointers to info parsed from notes, to brandinfo->header_supported filter.

Currently, we parse notes for the values of ELF FreeBSD feature flags
and osrel. Knowing these values, or knowing that image does not carry
the note if pointers are NULL, is useful to decide which ABI variant
(brand) we want to activate for the image.

Right now this is only a plumbing change

Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25273


# b24e6ac8 16-Apr-2020 Brooks Davis <brooks@FreeBSD.org>

Convert canary, execpathp, and pagesizes to pointers.

Use AUXARGS_ENTRY_PTR to export these pointers. This is a followup to
r359987 and r359988.

Reviewed by: jhb
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24446


# 9df1c38b 15-Apr-2020 Brooks Davis <brooks@FreeBSD.org>

Export argc, argv, envc, envv, and ps_strings in auxargs.

This simplifies discovery of these values, potentially with reducing the
number of syscalls we need to make at runtime. Longer term, we wish to
convert the startup process to pass an auxargs pointer to _start() and
use that rather than walking off the end of envv. This is cleaner,
more C-friendly, and for systems with strong bounds (e.g. CHERI)
necessary.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24407


# 59838c1a 01-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Retire procfs-based process debugging.

Modern debuggers and process tracers use ptrace() rather than procfs
for debugging. ptrace() has a supserset of functionality available
via procfs and new debugging features are only added to ptrace().
While the two debugging services share some fields in struct proc,
they each use dedicated fields and separate code. This results in
extra complexity to support a feature that hasn't been enabled in the
default install for several years.

PR: 244939 (exp-run)
Reviewed by: kib, mjg (earlier version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D23837


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 944cf37b 08-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Add AT_BSDFLAGS auxv entry.

The intent is to provide bsd-specific flags relevant to interpreter
and C runtime. I did not want to reuse AT_FLAGS which is common ELF
auxv entry.

Use bsdflags to report kernel support for sigfastblock(2). This
allows rtld and libthr to safely infer the syscall presence without
SIGSYS. The tunable kern.elf{32,64}.sigfastblock blocks reporting.

Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773


# 3ff65f71 30-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Remove duplicated empty lines from kern/*.c

No functional changes.


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 741dfd86 27-Dec-2019 Justin Hibbits <jhibbits@FreeBSD.org>

Fix the powerpc copyout fixup from r356113

Summary:
r356113 used an older patch, which predated the
freebsd_copyout_auxargs() addition. Fix this by using a private
powerpc_copyout_auxargs() instead, and keep it private to powerpc, not in MI
files.

Reviewed by: kib, bdragon
Differential Revision: https://reviews.freebsd.org/D22935


# 6b457273 26-Dec-2019 Justin Hibbits <jhibbits@FreeBSD.org>

Fix the build from r356113.

Types had changed from when the patch was first created, and a final build
was not done pre-commit.


# adea0d63 26-Dec-2019 Justin Hibbits <jhibbits@FreeBSD.org>

Eliminate the last MI difference in AT_* definitions (for powerpc).

Summary:
As a transition aide, implement an alternative elfN_freebsd_fixup which
is called for old powerpc binaries. Similarly, add a translation to rtld to
convert old values to new ones (as expected by a new rtld).

Translation of old<->new values is incomplete, but sufficient to allow an
installworld of a new userspace from an old one when a new kernel is running.

Test Plan:
Someone needs to see how a new kernel/rtld/libc works with an old
binary. If if works we can probalby ship this. If not we probalby need
some more compat bits.

Submitted by: brooks
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20799


# d8010b11 09-Dec-2019 John Baldwin <jhb@FreeBSD.org>

Copy out aux args after the argument and environment vectors.

Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector. Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack. It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.

This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).

Reviewed by: kib
Tested on: amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22695


# 31174518 03-Dec-2019 John Baldwin <jhb@FreeBSD.org>

Use uintptr_t instead of register_t * for the stack base.

- Use ustringp for the location of the argv and environment strings
and allow destp to travel further down the stack for the stackgap
and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs. This
used to hold translated system call arguments, but hasn't been used
since r159992.

Reviewed by: kib
Tested on: md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22501


# 03b0d68c 18-Nov-2019 John Baldwin <jhb@FreeBSD.org>

Check for errors from copyout() and suword*() in sv_copyout_args/strings.

Reviewed by: brooks, kib
Tested on: amd64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22401


# e3532331 15-Nov-2019 John Baldwin <jhb@FreeBSD.org>

Add a sv_copyout_auxargs() hook in sysentvec.

Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook. In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland. This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.

Reviewed by: brooks, emaste
Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22355


# 2e5f9189 31-Oct-2019 Ed Maste <emaste@FreeBSD.org>

avoid kernel stack data leak in core dump thrmisc note

bzero the entire thrmisc struct, not just the padding. Other core dump
notes are already done this way.

Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation


# 2288078c 08-Oct-2019 Doug Moore <dougm@FreeBSD.org>

Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882


# f33533da 21-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

kern.elf{32,64}.pie_base sysctl: enforce page alignment.

Requested by: rstone
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 95aafd69 21-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

Make non-ASLR pie base tunable.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 1073d17e 07-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

When loading ELF interpreter, initialize whole nested image_params with zero.

Otherwise we could mishandle imgp->textset.

Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21560


# a1549acb 05-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix mis-merge.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f422bc30 02-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Set ISOPEN in namei flags when opening executable interpreters.

These vnodes are explicitly opened via VOP_OPEN via
exec_check_permissions identical to the main exectuable image.
Setting ISOPEN allows filesystems to perform suitable checks in
VOP_LOOKUP (e.g. close-to-open consistency in the NFS client).

Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D21129


# fc83c5a7 31-Jul-2019 Konstantin Belousov <kib@FreeBSD.org>

Make randomized stack gap between strings and pointers to argv/envs.

This effectively makes the stack base on the csu _start entry
randomized.

The gap is enabled if ASLR is for the ABI is enabled, and then
kern.elf{64,32}.aslr.stack_gap specify the max percentage of the
initial stack size that can be wasted for gap. Setting it to zero
disables the gap, and max is capped at 50%.

Only amd64 for now.

Reviewed by: cem, markj
Discussed with: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21081


# e020a35f 24-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Remove a redundant offset computation in elf_load_section().

With r344705 the offset is always zero.

Submitted by: Wuyang Chung <wuyang.chung1@gmail.com>


# 5c32e9fc 18-Jun-2019 Alexander Motin <mav@FreeBSD.org>

Optimize kern.geom.conf* sysctls.

On large systems those sysctls may generate megabytes of output. Before
this change sbuf(9) code was resizing buffer by 4KB each time many times,
generating tons of TLB shootdowns. Unfortunately in this case existing
sbuf_new_for_sysctl() mechanism, supposed to help with this issue, is not
applicable, since all the sbuf writes are done in different kernel thread.

This change improves situation in two ways:
- on first sysctl call, not providing any output buffer, it sets special
sbuf drain function, just counting the data and so not needing big buffer;
- on second sysctl call it uses as initial buffer size value saved on
previous call, so that in most cases there will be no reallocation, unless
GEOM topology changed significantly.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# f1f81d3b 17-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Grammar fixes for r347690.

Submitted by: alc
MFC after: 3 days


# 0ddfdc60 16-May-2019 Konstantin Belousov <kib@FreeBSD.org>

imgact_elf.c: Add comment explaining the malloc/VOP_UNLOCK() dance
from r347148.

Requested by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 78022527 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Switch to use shared vnode locks for text files during image activation.

kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923


# 2d6b8546 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

imgact_elf: do not relock the text vnode if possible.

We unlock the vnode around malloc(M_WAITOK), to make it possible for
pagedaemon to flush vnode pages for us. Instead of doing it
unconditionally, first try M_NOWAIT allocation, which typically
succeed. Only on failure, unlock the vnode and retry with M_WAITOK.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19923


# 4033ecc9 11-Apr-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Use shared vnode locks for the ELF interpreter.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19874


# b65ca345 10-Apr-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Improve vnode lock assertions.

MFC after: 2 weeks
Sponsored by: DARPA, AFRL


# 9bcd7482 09-Apr-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Factor out section loading into a separate function.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19846


# 9274fb35 08-Apr-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Refactor ELF interpreter loading into a separate function.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19741


# be7808dc 30-Mar-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix branding after r345661.

In particular, elf32 FreeBSD binaries were not executed on LP64 hosts.
The interp_name_len value should account for the nul terminator. This
is needed for strncmp()s in brand checking code to work.

Reported by: andreast
Sponsored by: The FreeBSD Foundation
MFC after: 12 days (together with r345661)


# 09c78d53 28-Mar-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Factor out retrieving the interpreter path from the main ELF
loader routine.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19715


# 20e1174a 26-Mar-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Factor out resource limit enforcement code in the ELF loader.
It makes the code slightly easier to follow, and might make
it easier to fix the resouce accounting to also account for
the interpreter.

The PROC_UNLOCK() is moved earlier - I don't see anything
it should protect; the lim_max() is a wrapper around lim_rlimit(),
and that, differently from lim_rlimit_proc(), doesn't require
the proc lock to be held.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19689


# 545517f1 23-Mar-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove trunc_page_ps() and round_page_ps() macros. This completes
the undoing of r100384.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19680


# 1699546d 01-Mar-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove sv_pagesize, originally introduced with r100384.

In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.

Reviewed by: imp (who also contributed the commit message)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19280


# fa50a355 10-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

Implement Address Space Layout Randomization (ASLR)

With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.

Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.

The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.

I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.

To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.

The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.

ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.

Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.

PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603


# eb785fab 06-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

Port sysctl kern.elf32.read_exec from amd64 to i386.

Make it more comprehensive on i386, by not setting nx bit for any
mapping, not just adding PF_X to all kernel-loaded ELF segments. This
is needed for the compatibility with older i386 programs that assume
that read access implies exec, e.g. old X servers with hand-rolled
module loader.

Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# b0b246b0 08-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Remove proctree acquire from note_procstat_proc

It is not needed since r340482 ("proc: always store parent pid in p_oppid")

Sponsored by: The FreeBSD Foundation


# cefb93f2 23-Nov-2018 Konstantin Belousov <kib@FreeBSD.org>

Parse FreeBSD Feature Control note on the ELF image activation.

Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 92328a32 23-Nov-2018 Konstantin Belousov <kib@FreeBSD.org>

Generalize ELF parse_notes().

Remove the knowledge of the ABI note type and brandnote from it,
instead provide it with a callback to do note-specific matching and
data fetching. Implement callback to match against ELF brand.

Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# eda8fe63 23-Nov-2018 Konstantin Belousov <kib@FreeBSD.org>

Trivial reduction of the code duplication, reuse the return FALSE code.

Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 4bf4b0f1 07-Nov-2018 John Baldwin <jhb@FreeBSD.org>

Enable non-executable stacks by default on RISC-V.

Reviewed by: markj
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D17878


# c9e562b1 11-Sep-2018 Gordon Tetlow <gordon@FreeBSD.org>

Correct ELF header parsing code to prevent invalid ELF sections from
disclosing memory.

Submitted by: markj
Reported by: Thomas Barabosch, Fraunhofer FKIE
Approved by: re (implicit)
Approved by: so
Security: FreeBSD-SA-18:12.elf
Security: CVE-2018-6924
Sponsored by: The FreeBSD Foundation


# 455d3589 30-Jul-2018 David E. O'Brien <obrien@FreeBSD.org>

Correct copyright dates.


# 53e20b27 19-Jul-2018 Konstantin Belousov <kib@FreeBSD.org>

When reporting an error, print the errno value.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# d8b2f079 29-May-2018 Brooks Davis <brooks@FreeBSD.org>

Correct pointer subtraction in KASSERT().

The assertion would never fire without truly spectacular future
programming errors.

Reported by: Coverity
CID: 1391367, 1391368
Sponsored by: DARPA, AFRL


# 5f77b8a8 24-May-2018 Brooks Davis <brooks@FreeBSD.org>

Avoid two suword() calls per auxarg entry.

Instead, construct an auxargs array and copy it out all at once.

Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent
the array. This is the correct type where pairs of words just happend
to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act
on this array rather than introducing a new macro.

Return errors on copyout() and suword() failures and handle them in the
caller.

Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64
which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32
(now removed due to the use of proper types).

Reviewed by: kib
Comments from: emaste, jhb
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15485


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# a95659f7 13-Mar-2018 Ed Maste <emaste@FreeBSD.org>

Use C99 boolean type for translate_osrel

Migrate to modern types before creating MD Linuxolator bits for new
architectures.

Reviewed by: cem
Sponsored by: Turing Robotic Industries Inc.
Differential Revision: https://reviews.freebsd.org/D14676


# b7feabf9 13-Mar-2018 Ed Maste <emaste@FreeBSD.org>

Use C99 designated initializers for struct execsw

It it makes use slightly more clear and facilitates grepping.


# 5cc6d253 12-Mar-2018 Ed Maste <emaste@FreeBSD.org>

ANSIfy sys/kern/imgact_*


# d722231b 05-Feb-2018 John Baldwin <jhb@FreeBSD.org>

Always give ELF brands a chance to veto a match.

If a brand provides a header_supported hook, check it when trying to
find a brand based on a matching interpreter as well as in the final
loop for the fallback brand. Previously a brand might reject a binary
via a header_supported hook in one of the earlier loops, but still be
chosen by one of these later loops.

Reviewed by: kib
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D13945


# 78f57a9c 08-Jan-2018 Mark Johnston <markj@FreeBSD.org>

Generalize the gzio API.

We currently use a set of subroutines in kern_gzio.c to perform
compression of user and kernel core dumps. In the interest of adding
support for other compression algorithms (zstd) in this role without
complicating the API consumers, add a simple compressor API which can be
used to select an algorithm.

Also change the (non-default) GZIO kernel option to not enable
compressed user cores by default. It's not clear that such a default
would be desirable with support for multiple algorithms implemented,
and it's inconsistent in that it isn't applied to kernel dumps.

Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D13632


# 8a36da99 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 904d8c49 20-Oct-2017 Michal Meloun <mmel@FreeBSD.org>

Add AT_HWCAP2 ELF auxiliary vector.
- allocate value for new AT_HWCAP2 auxiliary vector on all platforms.
- expand 'struct sysentvec' by new 'u_long *sv_hwcap2', in exactly
same way as for AT_HWCAP.

MFC after: 1 month
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D12699


# c2f37b92 14-Sep-2017 John Baldwin <jhb@FreeBSD.org>

Add AT_HWCAP and AT_EHDRFLAGS on all platforms.

A new 'u_long *sv_hwcap' field is added to 'struct sysentvec'. A
process ABI can set this field to point to a value holding a mask of
architecture-specific CPU feature flags. If an ABI does not wish to
supply AT_HWCAP to processes the field can be left as NULL.

The support code for AT_EHDRFLAGS was already present on all systems,
just the #define was not present. This is a step towards unifying the
AT_* constants across platforms.

Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12290


# 51645e83 29-Jun-2017 John Baldwin <jhb@FreeBSD.org>

Store a 32-bit PT_LWPINFO struct for 32-bit process core dumps.

Process core notes for a 32-bit process running on a 64-bit host need to
use 32-bit structures so that the note layout matches the layout of notes
of a core dump of a 32-bit process under a 32-bit kernel.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D11407


# 86be94fc 30-Mar-2017 Tycho Nightingale <tychon@FreeBSD.org>

Add support for capturing 'struct ptrace_lwpinfo' for signals
resulting in a process dumping core in the corefile.

Also extend procstat to view select members of 'struct ptrace_lwpinfo'
from the contents of the note.

Sponsored by: Dell EMC Isilon


# 3aeacc55 29-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

A followup to r315749, two more places where brand->interp_path was
accessed unconditionally.

Reported by: se
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 0fe98320 23-Mar-2017 Ed Schouten <ed@FreeBSD.org>

Don't require the presence of the compat_3_brand.

The existing ELF image activator requires the brandinfo to provide such
a string unconditionally, even if the executable format in question
doesn't use this type of branding. Skip matching when it's a null
pointer.

Reviewed by: kib
MFC after: 2 weeks


# 2274ab3d 22-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Update r315753 with the proper flag name.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 1438fe3c 22-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Add a flag BI_BRAND_ONLY_STATIC to specify that the brand only
matches static binaries.

Interpretation of the 'static' there is that the binary must not
specify an interpreter. In particular, shared objects are matched by
the brand if BI_CAN_EXEC_DYN is also set.

This improves precision of the brand matching, which should eliminate
surprises due to brand ordering.

Revert r315701.

Discussed with and tested by: ed (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 7aab7a80 22-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Adjust r314851 to not require every brand to specify interpreter path.

Reported and tested by: ed
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# c547cbb4 18-Mar-2017 Alan Cox <alc@FreeBSD.org>

Avoid unnecessary calls to vm_map_protect() in elf_load_section().

Typically, when elf_load_section() unconditionally passed VM_PROT_ALL to
elf_map_insert(), it was needlessly enabling execute access on the
mapping, and it would later have to call vm_map_protect() to correct the
mapping's access rights. Now, instead, elf_load_section() always passes
its parameter "prot" to elf_map_insert(). So, elf_load_section() must
only call vm_map_protect() if it needs to remove the write access that
was temporarily granted to perform a copyout().

Reviewed by: kib
MFC after: 1 week


# 9bcf2f2d 12-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Accept linkers representation for ELF segments with zero on-disk length.

For such segments, GNU bfd linker writes knowingly incorrect value
into the the file offset field of the program header entry, with the
motivation that file should not be mapped for creation of this segment
at all.

Relax checks for the ELF structure validity when on-disk segment
length is zero, and explicitely set mapping length to zero for such
segments to avoid validating rounding arithmetic.

PR: 217610
Reported by: Robert Clausecker <fuz@fuz.su>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 973d67c4 12-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Style.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e383e820 11-Mar-2017 Alan Cox <alc@FreeBSD.org>

Simplify the control flow and tidy up a comment in map_insert.

In collaboration with: kib
MFC after: 1 week


# 15a9aedf 07-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

When selecting brand based on old Elf branding, prefer the brand which
interpreter exactly matches the one requested by the activated image.

This change applies r295277, which did the same for note branding, to
the old brand selection, with the same reasoning of fixing compat32
interpreter substitution.

PR: 211837
Reported by: kenji@kens.fm
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3d560b4b 07-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Require whole brand string matching for old Elf branding.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 0bbee4cd 07-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Consistently use vm_ooffset_t type for the vm object offset in
elf_load_section.

The values passed currently as vm_offset_t are phdr.p_offset, which
have the native Elf word size. Since elf_load_section interprets them
as the file offset, use vm object offset type.

Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# aaadc41f 06-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Instead of direct use of vm_map_insert(), call vm_map_fixed(MAP_CHECK_EXCL).

This KPI explicitely indicates the intent of creating the mapping at
the fixed address, and incorporates the map locking into the callee.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 28e8da65 05-Mar-2017 Alan Cox <alc@FreeBSD.org>

Style and punctuation fixes.

Reviewed by: kib
MFC after: 3 days


# fe0a8a39 02-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Style.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 55b985b4 01-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Use vm_map_insert() instead of vm_map_find() in elf_map_insert().

Elf_map_insert() needs to create mapping at the known fixed address.
Usage of vm_map_find() assumes, on the other hand, that any suitable
address space range above or equal the specified hint, is acceptable.
Due to operating on the fresh or cleared address space, vm_map_find()
usually creates mapping starting exactly at hint.

Switch to vm_map_insert() use to clearly request fixed mapping from
the VM.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# e3d8f8fe 01-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

When deallocating the vm object in elf_map_insert() due to
vm_map_insert() failure, drop the vnode lock around the call to
vm_object_deallocate().

Since the deallocated object is the vm object of the vnode, we might
get the vnode lock recursion there. In fact, it is almost impossible
to make vm_map_insert() failing there on stock kernel.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 885f13dc 07-Feb-2017 John Baldwin <jhb@FreeBSD.org>

Copy the e_machine and e_flags fields from the binary into an ELF core dump.

In the kernel, cache the machine and flags fields from ELF header to use in
the ELF header of a core dump. For gcore, the copy these fields over from
the ELF header in the binary.

This matters for platforms which encode ABI information in the flags field
(such as o32 vs n32 on MIPS).

Reviewed by: kib
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D9392


# 77ebe276 24-Jan-2017 Ed Maste <emaste@FreeBSD.org>

imgact_elf: refactor et_dyn_addr calculation

This simplifies the logic somewhat. It is extracted from the change in
review in D5603.

Differential Revision: https://reviews.freebsd.org/D9321


# c468ff88 20-Jan-2017 Andriy Gapon <avg@FreeBSD.org>

don't abort writing of a core dump after EFAULT

It's possible to get EFAULT when writing a segment backed by a file
if the segment extends beyond the file.
The core dump could still be useful if we skip the rest of the segment
and proceed to other segements.
The skipped segment (or a portion of it) will be zero-filled.

While there, use 'const' to signify that core_write() only reads the
buffer and use __DECONST before calling vn_rdwr_inchunks() because it
can be used for both reading and writing.

Before the change:
kernel: Failed to write core file for process mmap_trunc_core (error 14)
kernel: pid 77718 (mmap_trunc_core), uid 1001: exited on signal 6

After the change:
kernel: Failed to fully fault in a core file segment at VA 0x800645000 with size 0x4000 to be written at offset 0x29000 for process mmap_trunc_core
kernel: pid 4901 (mmap_trunc_core), uid 1001: exited on signal 6 (core dumped)

Reviewed by: julian, kib
Obtained from: Panzura (older version of the change)
MFC after: 5 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9233


# 5420f76b 04-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Style.

Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 09c69701 30-Aug-2016 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Back out misfired extra file in r305108.


# c9a124dc 30-Aug-2016 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Refix operation on sparse CPU mappings as in r302372, temporarily broken
by r304716.

PR: kern/210106
MFC after: 2 days


# 1005d8af 20-Jul-2016 Conrad Meyer <cem@FreeBSD.org>

imgact_elf: Rename the segment iterator to match reality

The each_writable_segment routine evaluates segments on a slightly little more
nuanced metric than simply "writable" or not. Rename the function to more
closely match its behavior (each_dumpable_segment).

Suggested by: jhb
Sponsored by: EMC / Isilon Storage Division


# f3325003 20-Jul-2016 Conrad Meyer <cem@FreeBSD.org>

ANSI-fy imgact_elf.c

Sponsored by: EMC / Isilon Storage Division


# 07f825e8 20-Jul-2016 Conrad Meyer <cem@FreeBSD.org>

Fix DEBUG build on 64-bit arch after r303099

Reported by: Larry Rosenman <ler at lerctr.org>


# c17b0bd2 20-Jul-2016 Conrad Meyer <cem@FreeBSD.org>

Extend ELF coredump to support more than 65535 segments

The ELF e_phnum field is only 16 bits wide. To support more than 65535 segments
(program headers), Sun's "Linker and Libraries Guide" table 7-7 (or 12-7,
depending on document version) prescribes a special first section header where
sh_info represents the real number of program headers.

Test code to follow, when it is ready.

Reference: http://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf

Reviewed by: emaste, markj
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7255


# ccb83afd 18-Jul-2016 John Baldwin <jhb@FreeBSD.org>

Include process IDs in core dumps.

When threads were added to the kernel, the pr_pid member of the
NT_PRSTATUS note was repurposed to store LWP IDs instead of process
IDs. However, the process ID was no longer recorded in core dumps.
This change adds a pr_pid field to prpsinfo (NT_PRSINFO). Rather than
bumping the prpsinfo version number, note parsers can use the note's
payload size to determine if pr_pid is present.

Reviewed by: kib, emaste (older version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D7117


# c77547d2 14-Jul-2016 John Baldwin <jhb@FreeBSD.org>

Include command line arguments in core dump process info.

Fill in pr_psargs in the NT_PRSINFO ELF core dump note with command
line arguments.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D7116


# 1cbb879d 05-Jul-2016 Ed Maste <emaste@FreeBSD.org>

add description for debug.elf{32,64}_legacy_coredump sysctl

Approved by: re (kib)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# a66dc0c5 25-May-2016 Ian Lepore <ian@FreeBSD.org>

Include machine/acle-compat.h in cdefs.h on arm if the compiler doesn't
have ACLE support built in. The ACLE (ARM C Language Extensions) defines
a set of standardized symbols which indicate the architecture version and
features available. ACLE support is built in to modern compilers (both
clang and gcc), but absent from gcc prior to 4.4.

ARM (the company) provides the acle-compat.h header file to define the
right symbols for older versions of gcc. Basically, acle-compat.h does
for arm about the same thing cdefs.h does for freebsd: defines
standardized macros that work no matter which compiler you use. If ARM
hadn't provided this file we would have ended up with a big #ifdef __arm__
section in cdefs.h with our own compatibility shims.

Remove #include <machine/acle-compat.h> from the zillion other places (an
ever-growing list) that it appears. Since style(9) requires sys/types.h
or sys/param.h early in the include list, and both of those lead to
including cdefs.h, only a couple special cases still need to include
acle-compat.h directly.

Loves it: imp


# d9c9c81c 21-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: use our roundup2/rounddown2() macros when param.h is available.

rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.


# 35030a5d 29-Mar-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove some NULL checks for M_WAITOK allocations.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# af582aae 04-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

When matching brand to the ELF binary by notes, try to find a brand
with interpreter name exactly matching one wanted by the binary. If
no such brand exists, return first brand which accepted the binary by
note.

The change fixes a regression after r292749, where e.g. our two ia32
compat brands, ia32_brand_info and ia32_brand_oinfo, only differ by
the interpeter path and binary matches to a brand by linkage order.
Then old binaries which require /usr/libexec/ld-elf.so.1 but matched
against ia32_brand_info with interp_path /libexec/ld-elf.so.1, were
considered requiring non-standard interpreter name, and magic to force
ld-elf32.so.1 did not happen.

Note that it might make sense to apply the same selection of brands
for other matching criteria, SCO EI_OSABI and 3.x string.

Reported and tested by: dwmalone
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 18995077 26-Dec-2015 Konstantin Belousov <kib@FreeBSD.org>

Do not substitute interpeter if the brand interpreter path is
different from the interpreter path requested by the binary.

Before this change, it is impossible to activate non-default
interpreter for 32bit image on amd64, when /libexec/ld-elf32.so.1 file
exists.

Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# d3ee0a15 23-Dec-2015 Jonathan T. Looney <jtl@FreeBSD.org>

Only allow one PT_INTERP ELF program header. This also fixes a potential
memory leak for interp_buf.

Differential Revision: https://reviews.freebsd.org/D4692
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Juniper Networks


# d943fa35 22-Dec-2015 Konstantin Belousov <kib@FreeBSD.org>

If we annoy user with the terminal output due to failed load of
interpreter, also show the actual error code instead of some
interpretation.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4c22b468 07-Dec-2015 Ed Maste <emaste@FreeBSD.org>

Replace magic value ELF note type with NT_FREEBSD_ABI_TAG

As of r291909 elf_common.h provides a definition.

Suggested by: kib
Sponsored by: The FreeBSD Foundation


# 4d22d07a 06-Dec-2015 Konstantin Belousov <kib@FreeBSD.org>

Add support for usermode (vdso-like) gettimeofday(2) and
clock_gettime(2) on ARMv7 and ARMv8 systems which have architectural
generic timer hardware. It is similar how the RDTSC timer is used in
userspace on x86.

Fix a permission problem where generic timer access from EL0 (or
userspace on v7) was not properly initialized on APs.

For ARMv7, mark the stack non-executable. The shared page is added for
all arms (including ARMv8 64bit), and the signal trampoline code is
moved to the page.

Reviewed by: andrew
Discussed with: emaste, mmel
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D4209


# f19d421a 01-Dec-2015 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Missed header_supported call from r291020: make really, really sure the brand
likes the executable.


# 686d2f31 18-Nov-2015 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Extend r270123 to run the brand info's header_supported() routine for
branded as well as unbranded binaries. This will be required to add
support for the new ELFv2 ABI on powerpc64, which is distinguished from
ELFv1 by the contents of the ELF header's flags field.

Reviewed by: imp
MFC after: 2 weeks


# 9a12e282 01-Nov-2015 Enji Cooper <ngie@FreeBSD.org>

Define `compress` in `__elfN(coredump)` when #ifdef GZIO is true to mute
an -Wunused-but-set-variable warning

Reported by: FreeBSD_HEAD_amd64_gcc4.9 jenkins job
Sponsored by: EMC / Isilon Storage Division


# 6c775eb6 14-Oct-2015 Konstantin Belousov <kib@FreeBSD.org>

Allow PT_INTERP and PT_NOTES segments to be located anywhere in the
executable image. Keep one page (arbitrary) limit on the max allowed
size of the PT_NOTES.

The ELF image activators still require that program headers of the
executable are fully contained in the first page of the image file.

Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D3871


# e6b95927 06-Oct-2015 Conrad Meyer <cem@FreeBSD.org>

Fix core corruption caused by race in note_procstat_vmmap

This fix is spiritually similar to r287442 and was discovered thanks to
the KASSERT added in that revision.

NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' vm map
via vn_fullpath. As vnodes may move during coredump, this is racy.

We do not remove the race, only prevent it from causing coredump
corruption.

- Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable
kinfo packing for PROCSTAT_VMMAP notes. This avoids VMMAP corruption
and truncation, even if names change, at the cost of up to PATH_MAX
bytes per mapped object. The new sysctl is documented in core.5.

- Fix note_procstat_vmmap to self-limit in the second pass. This
addresses corruption, at the cost of sometimes producing a truncated
result.

- Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste)
to grok the new zero padding.

Reported by: pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3824


# bcb60d52 07-Sep-2015 Conrad Meyer <cem@FreeBSD.org>

Follow-up to r287442: Move sysctl to compiled-once file

Avoid duplicate sysctl nodes.

Found by: tijl
Approved by: markj (mentor)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3586


# 14bdbaf2 03-Sep-2015 Conrad Meyer <cem@FreeBSD.org>

Detect badly behaved coredump note helpers

Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.

When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.

NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath. As vnodes may move around during dump, this is racy.

So:

- Detect badly behaved notes in putnote() and pad underfilled notes.

- Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
exercise the NT_PROCSTAT_FILES corruption. It simply picks random
lengths to expand or truncate paths to in fo_fill_kinfo_vnode().

- Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
disable kinfo packing for PROCSTAT_FILES notes. This should avoid
both FILES note corruption and truncation, even if filenames change,
at the cost of about 1 kiB in padding bloat per open fd. Document
the new sysctl in core.5.

- Fix note_procstat_files to self-limit in the 2nd pass. Since
sometimes this will result in a short write, pad up to our advertised
size. This addresses note corruption, at the risk of sometimes
truncating the last several fd info entries.

- Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
zero padding.

With suggestions from: bjk, jhb, kib, wblock
Approved by: markj (mentor)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3548


# 02d131ad 14-Jul-2015 Mark Johnston <markj@FreeBSD.org>

Fix some error-handling bugs when core dump compression is enabled:
- Ensure that core dump parameters are initialized in the error path.
- Don't call gzio_fini() on a NULL stream.

Reported by: rpaulo


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 6b16d664 08-Jun-2015 Ed Maste <emaste@FreeBSD.org>

Add user facing errors for exceeding process memory limits

Previously the process terminating with SIGABRT at startup was the
only notification.

PR: 200617
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D2731


# ee960398 22-May-2015 Warner Losh <imp@FreeBSD.org>

Fix typo in symbol name. It helps to hit save in all your buffers
before committing.


# d36eec69 22-May-2015 Warner Losh <imp@FreeBSD.org>

Export the eflags field from the elf header. This allows better
discrimination between different subarch binaries, at least for mips
and arm. Arm is implemented, mips is still tbd, so not currently
exported. aarch64 does not export this because aarch64 binaries use
different tags and flags than arm.

Differential Revision: https://reviews.freebsd.org/D2611


# 4b5c9cf6 29-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation


# 316b3843 15-Apr-2015 Konstantin Belousov <kib@FreeBSD.org>

Implement support for binary to requesting specific stack size for the
initial thread. It is read by the ELF image activator as the virtual
size of the PT_GNU_STACK program header entry, and can be specified by
the linker option -z stack-size in newer binutils.

The soft RLIMIT_STACK is auto-increased if possible, to satisfy the
binary' request.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# aa14e9b7 08-Mar-2015 Mark Johnston <markj@FreeBSD.org>

Reimplement support for userland core dump compression using a new interface
in kern_gzio.c. The old gzio interface was somewhat inflexible and has not
worked properly since r272535: currently, the gzio functions are called with
a range lock held on the output vnode, but kern_gzio.c does not pass the
IO_RANGELOCKED flag to vn_rdwr() calls, resulting in deadlock when vn_rdwr()
attempts to reacquire the range lock. Moreover, the new gzio interface can
be used to implement kernel core compression.

This change also modifies the kernel configuration options needed to enable
userland core dump compression support: gzio is now an option rather than a
device, and the COMPRESS_USER_CORES option is removed. Core dump compression
is enabled using the kern.compress_user_cores sysctl/tunable.

Differential Revision: https://reviews.freebsd.org/D1832
Reviewed by: rpaulo
Discussed with: kib


# b96bd95b 27-Feb-2015 Ian Lepore <ian@FreeBSD.org>

Allow the kern.osrelease and kern.osreldate sysctl values to be set in a
jail's creation parameters. This allows the kernel version to be reliably
spoofed within the jail whether examined directly with sysctl or
indirectly with the uname -r and -K options.

The values can only be set at jail creation time, to eliminate the need
for any locking when accessing the values via sysctl.

The overridden values are inherited by nested jails (unless the config for
the nested jails also overrides the values).

There is no sanity or range checking, other than disallowing an empty
release string or a zero release date, by design. The system
administrator is trusted to set sane values. Setting values that are
newer than the actual running kernel will likely cause compatibility
problems.

Differential Revision: https://reviews.freebsd.org/D1948
Relnotes: yes


# bc411bc2 14-Feb-2015 John Baldwin <jhb@FreeBSD.org>

Include OBJT_PHYS VM objects in ELF core dumps. In particular this
includes the shared page allowing debuggers to use the signal trampoline
code to identify signal frames in core dumps.

Differential Revision: https://reviews.freebsd.org/D1828
Reviewed by: alc, kib
MFC after: 1 week


# 64779280 22-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

The size value should be asserted when it is known.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation


# 180e57e5 21-Nov-2014 John Baldwin <jhb@FreeBSD.org>

Improve support for XSAVE with debuggers.
- Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed
to match what Linux does in that 1) it dumps the entire XSAVE area
including the fxsave state, and 2) it stashes a copy of the current
xsave mask in the unused padding between the fxsave state and the
xstate header at the same location used by Linux.
- Teach readelf() to recognize NT_X86_XSTATE notes.
- Change PT_GET/SETXSTATE to take the entire XSAVE state instead of
only the extra portion. This avoids having to always make two
ptrace() calls to get or set the full XSAVE state.
- Add a PT_GET_XSTATE_INFO which returns the length of the current
XSTATE save area (so the size of the buffer needed for PT_GETXSTATE)
and the current XSAVE mask (%xcr0).

Differential Revision: https://reviews.freebsd.org/D1193
Reviewed by: kib
MFC after: 2 weeks


# 539c9eef 04-Oct-2014 Konstantin Belousov <kib@FreeBSD.org>

Fixes for i/o during coredumping:
- Do not dump into system files.
- Do not acquire write reference to the mount point where img.core is
written, in the coredump(). The vn_rdwr() calls from ELF imgact
request the write ref from vn_rdwr(). Recursive acqusition of the
write ref deadlocks with the unmount.
- Instead, take the range lock for the whole core file. This prevents
parallel dumping from two processes executing the same image,
converting the useless interleaved dump into sequential dumping,
with second core overwriting the first.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 6662ce5a 29-Aug-2014 Mateusz Guzik <mjg@FreeBSD.org>

Add missing proctree locking to fill_kinfo_proc consumers.

This fixes r270444.

Pointy hat: mjg
Reported by: many
MFC after: 1 week


# 817dc004 17-Aug-2014 Warner Losh <imp@FreeBSD.org>

Expand the elf brandelf infrastructure to give access to the whole ELF
header (Elf_Ehdr) to determine if a particular interpretor wants to
accept it or not. Use this mechanism to filter EABI arm on OABI arm
kernels, and vice versa. This method could also be used to implement
OABI on EABI arm kernels, if desired, or to allow a single mips kernel
to run o32, n32 and n64 binaries.

Differential Revision: https://reviews.freebsd.org/D609


# e7d939bd 06-Jul-2014 Marcel Moolenaar <marcel@FreeBSD.org>

Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# 2dedc128 16-Jun-2014 Dmitry Chagin <dchagin@FreeBSD.org>

Revert r266925 as it can lead to instant panic at fexecve():

To allow to run the interpreter itself add a new ELF branding type.

Pointed out by: kib, mjg


# 5f56da18 31-May-2014 Dmitry Chagin <dchagin@FreeBSD.org>

To allow to run the interpreter itself add a new ELF branding type.
Allow Linux ABI to run ELF interpreter.

MFC after: 3 days


# 83a396ce 14-Apr-2014 Christian Brueffer <brueffer@FreeBSD.org>

Refine r264422: set buf to NULL only when we don't allocate memory,
and free buf unconditionally.

Requested by: kib
MFC after: 1 week


# a1761d73 13-Apr-2014 Christian Brueffer <brueffer@FreeBSD.org>

Free buf after usage.

CID: 1199377
Found with: Coverity Prevent(tm)
MFC after: 1 week


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# edb572a3 09-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space. This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address. While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by: alc
Approved by: re (kib)


# be996836 05-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by: EMC / Isilon storage division
Reported by: kib


# 3b6714ca 04-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff
Tested by: pho


# 1b8388cd 01-May-2013 Mikolaj Golub <trociny@FreeBSD.org>

Introduce a constant, ELF_NOTE_ROUNDSIZE, which evidently declare our
intention to use 4-byte padding for elf notes.

MFC after: 3 weeks


# f1fca82e 16-Apr-2013 Mikolaj Golub <trociny@FreeBSD.org>

Add a new set of notes to a process core dump to store procstat data.

The notes format is a header of sizeof(int), which stores the size of
the corresponding data structure to provide some versioning, and data
in the format as it is returned by a related sysctl call.

The userland tools (procstat(1)) will be taught to extract this data,
providing additional info for postmortem analysis.

PR: kern/173723
Suggested by: jhb
Discussed with: jhb, kib
Reviewed by: jhb (initial version), kib
MFC after: 1 month


# bd390213 14-Apr-2013 Mikolaj Golub <trociny@FreeBSD.org>

Re-factor coredump routines. For each type of notes an output
function is provided, which is used either to calculate the note size
or output it to sbuf. On the first pass the notes are registered in a
list and the resulting size is found, on the second pass the list is
traversed outputing notes to sbuf. For the sbuf a drain routine is
provided that writes data to a core file.

The main goal of the change is to make coredump to write notes
directly to the core file, without preliminary preparing them all in a
memory buffer. Storing notes in memory is not a problem for the
current, rather small, set of notes we write to the core, but it may
becomes an issue when we start to store procstat notes.

Reviewed by: jhb (initial version), kib
Discussed with: jhb, kib
MFC after: 3 weeks


# bc403f03 08-Apr-2013 Attilio Rao <attilio@FreeBSD.org>

Switch some "low-hanging fruit" to acquire read lock on vmobjects
rather than write locks.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# fb5ea9d1 07-Apr-2013 Mikolaj Golub <trociny@FreeBSD.org>

Fill p_flags and p_align fields of the core dump note segement.

Reviewed by: kib
MFC after: 2 weeks


# 27b05648 07-Apr-2013 Mikolaj Golub <trociny@FreeBSD.org>

Use 4-byte padding for core dump notes on both 32 and 64bit archs.

Although native word padding (i.e. 8-byte on 64bit arch) looks to be
in agreement with standards, other parts of our code and other OSes
use 4-byte alignment.

This is not expected to change alignment for currently generated core
dump notes, as the notes look to consist of structures with sizes
multiple of 8 on 64-bit archs. But there are plans to add additional
notes, where 4-byte vs 8-byte alignment makes difference.

Discussed with: kib
Reviewed by: kib
MFC after: 2 weeks


# d19d5bf4 13-Mar-2013 Tijl Coosemans <tijl@FreeBSD.org>

- Fix two possible overflows when testing if ELF program headers are on
the first page:
1. Cast uint16_t operands in a multiplication to unsigned int because
otherwise the implicit promotion to int results in a signed
multiplication that can overflow and the behaviour on integer
overflow is undefined.
2. Replace (offset + size > PAGE_SIZE) with (size > PAGE_SIZE - offset)
because the sum may overflow.
- Use the same tests to see if the path to the interpreter is on the first
page. There's no overflow here because size is already limited by
MAXPATHLEN, but the compiler optimises the new tests better. Also fix an
off-by-one error.
- Simplify tests to see if an ELF note program header is on the first page.
This also fixes an off-by-one error.

Reviewed by: kib
MFC after: 1 week


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 2871baa4 10-Feb-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the ia64-specific code fragment, which effect is more cleanly
done by the call to trans_prot() function a line before.

Discussed with: Oliver Pinter <oliver.pntr@gmail.com>
MFC after: 1 week


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 877d24ac 28-Sep-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix the mis-handling of the VV_TEXT on the nullfs vnodes.

If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by: pho (previous version)
MFC after: 2 weeks


# d1ae5c83 19-Jul-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix several reads beyond the mapped first page of the binary in the
ELF parser. Specifically, do not allow note reader and interpreter
path comparision in the brandelf code to read past end of the page.
This may happen if specially crafter ELF image is activated.

Submitted by: Lukasz Wojcik <lukasz.wojcik zoho com>
MFC after: 3 days


# aea81038 22-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Implement mechanism to export some kernel timekeeping data to
usermode, using shared page. The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month


# 1a9c7dec 11-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

ELF image can have several PT_NOTE program headers. Look for the ELF
brand note in each header, instead of using only first one.

Reviewed by: kan
Tested by: andrew (arm), flo (sparc64)
MFC after: 3 weeks


# 62c625fd 30-Jan-2012 Konstantin Belousov <kib@FreeBSD.org>

Finally, try to enable the nxstacks on amd64 and powerpc64 for both 64bit
and 32bit ABIs. Also try to enable nxstacks for PAE/i386 when supported,
and some variants of powerpc32.

MFC after: 2 months (if ever)


# 1dfab802 17-Jan-2012 Alan Cox <alc@FreeBSD.org>

Explain why it is safe to unlock the vnode.

Requested by: kib


# 292177e6 16-Jan-2012 Alan Cox <alc@FreeBSD.org>

Improve abstraction. Eliminate direct access by elf*_load_section()
to an OBJT_VNODE-specific field of the vm object. The same
information can be just as easily obtained from the struct vattr that
is in struct image_params if the latter is passed to
elf*_load_section(). Moreover, by replacing the vmspace and vm
object parameters to elf*_load_section() with a struct image_params
parameter, we actually reduce the size of the object code.

In collaboration with: kib


# 9a14aa01 15-Jan-2012 Ulrich Spörlein <uqs@FreeBSD.org>

Convert files to UTF-8


# 126b36a2 14-Oct-2011 Konstantin Belousov <kib@FreeBSD.org>

Control the execution permission of the readable segments for
i386 binaries on the amd64 and ia64 with the sysctl, instead of
unconditionally enabling it.

Reviewed by: marcel


# 676eda08 13-Oct-2011 Marcel Moolenaar <marcel@FreeBSD.org>

In elf32_trans_prot() and when compiling for amd64 or ia64, add
PROT_EXECUTE when PROT_READ is needed. By default i386 allows
execution when reading is allowed and JDK 1.4.x depends on that.


# afcc55f3 06-Jul-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".


# 12bc222e 30-Jun-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Add some checks to ensure that Capsicum is behaving correctly, and add some
more explicit comments about what's going on and what future maintainers
need to do when e.g. adding a new operation to a sys_machdep.c.

Approved by: mentor(rwatson), re(bz)


# 1ba5ad42 05-Apr-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add accounting for most of the memory-related resources.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 08b163fa 02-Feb-2011 Matthew D Fleming <mdf@FreeBSD.org>

Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after: 1 week


# 26d8f3e1 08-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

Use the same expression to report stack protection mode for AT_STACKEXEC
as the expression used by exec_new_vmspace().


# 291c06a1 08-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

In elf image activator, read and apply the stack protection mode from
PT_GNU_STACK program header, if present and enabled. Two new sysctls
are provided, kern.elf32.nxstack and kern.elf64.nxstack, that allow to
enable PT_GNU_STACK for ABIs of specified bitsize, if ABI decided to
support shared page.

Inform rtld about access mode of the stack initial mapping by
AT_STACKPROT aux vector.

At the moment, the default is disabled, waiting for the usermode
support bits.


# ed167eaa 08-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

Collect code to translate between vm_prot_t and p_flags into helper
functions.

MFC after: 1 week


# 7f08176e 22-Nov-2010 Attilio Rao <attilio@FreeBSD.org>

Add the ability for GDB to printout the thread name along with other
thread specific informations.

In order to do that, and in order to avoid KBI breakage with existing
infrastructure the following semantic is implemented:
- For live programs, a new member to the PT_LWPINFO is added (pl_tdname)
- For cores, a new ELF note is added (NT_THRMISC) that can be used for
storing thread specific, miscellaneous, informations. Right now it is
just popluated with a thread name.

GDB, then, retrieves the correct informations from the corefile via the
BFD interface, as it groks the ELF notes and create appropriate
pseudo-sections.

Sponsored by: Sandvine Incorporated
Tested by: gianni
Discussed with: dim, kan, kib
MFC after: 2 weeks


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# ee235bef 17-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Supply some useful information to the started image using ELF aux vectors.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.

Tested by: marius (sparc64)
MFC after: 1 month


# fba6b1af 29-Apr-2010 Alfred Perlstein <alfred@FreeBSD.org>

Don't leak core_buf or gzfile if doing a compressed core file and we
hit an error condition.

Obtained from: Juniper Networks


# 4ccf64eb 06-Apr-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

MFC r205014,205015:

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

This MFC is required for MFCs of later changes to the freebsd32
compatibility from HEAD.

Requested by: kib


# a0ea661f 25-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Add the ELF relocation base to struct image_params. This will be
required to correctly relocate the executable entry point's function
descriptor on powerpc64.


# 920acedb 25-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Change the way text_addr and data_addr are computed to use the
executable status of segments instead of detecting the main text segment
by which segment contains the program entry point. This affects
obreak() and is required for correct operation of that function
on 64-bit PowerPC systems. The previous behavior was apparently
required only for the Alpha, which is no longer supported.

Reviewed by: jhb
Tested on: amd64, sparc64, powerpc


# 841c0c7e 11-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by: kib, jhb


# 8b325009 04-Mar-2010 Alfred Perlstein <alfred@FreeBSD.org>

put calls to gzclose() under ifdef COMPRESS_USER_CORES to prevent
undefined symbols on kernels without this option.

Reported by: Alexander Best


# e7228204 01-Mar-2010 Alfred Perlstein <alfred@FreeBSD.org>

Merge projects/enhanced_coredumps (r204346) into HEAD:

Enhanced process coredump routines.

This brings in the following features:
1) Limit number of cores per process via the %I coredump formatter.
Example:
if corefilename is set to %N.%I.core AND num_cores = 3, then
if a process "rpd" cores, then the corefile will be named
"rpd.0.core", however if it cores again, then the kernel will
generate "rpd.1.core" until we hit the limit of "num_cores".

this is useful to get several corefiles, but also prevent filling
the machine with corefiles.

2) Encode machine hostname in core dump name via %H.

3) Compress coredumps, useful for embedded platforms with limited space.
A sysctl kern.compress_user_cores is made available if turned on.

To enable compressed coredumps, the following config options need to be set:
options COMPRESS_USER_CORES
device zlib # brings in the zlib requirements.
device gzio # brings in the kernel vnode gzip output module.

4) Eventhandlers are fired to indicate coredumps in progress.

5) The imgact sv_coredump routine has grown a flag to pass in more
state, currently this is used only for passing a flag down to compress
the coredump or not.

Note that the gzio facility can be used for generic output of gzip'd
streams via vnodes.

Obtained from: Juniper Networks
Reviewed by: kan


# 8cb7f89d 05-Dec-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r197726:

Print a warning in case we cannot add more brandinfo because
we would overflow the MAX_BRANDS sized array.

Reviewed by: kib


# dc68cec6 20-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197934:
Map PIE binaries at non-zero base address.

MFC r198202:
Honour non-zero mapbase for PIE binaries. Inform interpreter-less PIE
binary about its relocbase.

Approved by: re (kensmith)


# 5b15472f 20-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197932:
Do not map elf segments of zero length.

Approved by: re (kensmith)


# 7564c4ad 17-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

If ET_DYN binary has non-zero base address for some reason, honour it
and do not relocate the binary to ET_DYN_LOAD_ADDR. This allows for the
binary author to influence address map of the process. In particular,
when the binary is actually an interpeter, this allows to have almost
usual process address map.

Communicate the relocation bias of the mapping for interpeter-less
ET_DYN binary, that is interperter itself, in AT_BASE aux entry. This
way, rtld is able to find its dynamic structure and relocate itself.
Note that mapbase in the rtld is still wrong and requires further
fixing.

Reported and tested by: rwatson
Discussed with: kan
MFC after: 3 days


# ab02d85f 10-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Map PIE binaries at non-zero base address.

Discussed with: bz
Reviewed by: kan
Tested by: bz (i386, amd64), bsam (linux)
MFC after: some time


# 5b33842a 10-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Do not map segments of zero length.

Discussed with: bz
Reviewed by: kan
Tested by: bz (i386, amd64), bsam (linux)
MFC after: some time


# 925c8b5b 03-Oct-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Print a warning in case we cannot add more brandinfo because
we would overflow the MAX_BRANDS sized array.

Reviewed by: kib
MFC After: 1 month


# 914e5afe 02-Sep-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r196653:
Make sure FreeBSD binaries without .note.ABI-tag section work
correctly and do not match a colliding Debian GNU/kFreeBSD
brandinfo statements.
For this mark the Debian GNU/kFreeBSD brandinfo that it must have
an .note.ABI-tag section and ignore the old EI_OSABI brandinfo
when comparing a possibly colliding set of options.

Due to SYSINIT we add the brandinfo in a non-deterministic order,
so native FreeBSD is not always first. We may want to consider
to force native FreeBSD to come first as well.

The only way a problem could currently be noticed is when running an
i386 binary without the .note.ABI-tag on amd64 and the Debian GNU/kFreeBSD
brandinfo was matched first, as the fallback to ld-elf32.so.1 does
not exist in that case.

Reported and tested by: ticso
In collaboration with: kib
MFC after: 3 days
Approved by: re (rwatson)


# ecc2fda8 30-Aug-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Make sure FreeBSD binaries without .note.ABI-tag section work
correctly and do not match a colliding Debian GNU/kFreeBSD
brandinfo statements.
For this mark the Debian GNU/kFreeBSD brandinfo that it must have
an .note.ABI-tag section and ignore the old EI_OSABI brandinfo
when comparing a possibly colliding set of options.

Due to SYSINIT we add the brandinfo in a non-deterministic order,
so native FreeBSD is not always first. We may want to consider
to force native FreeBSD to come first as well.

The only way a problem could currently be noticed is when running an
i386 binary without the .note.ABI-tag on amd64 and the Debian GNU/kFreeBSD
brandinfo was matched first, as the fallback to ld-elf32.so.1 does
not exist in that case.

Reported and tested by: ticso
In collaboration with: kib
MFC after: 3 days


# ac63e409 27-Aug-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r196512:

Fix handling of .note.ABI-tag section for GNU systems [1].
Handle GNU/Linux according to LSB Core Specification 4.0,
Chapter 11. Object Format, 11.8. ABI note tag.

Also check the first word of desc, not only name, according to
glibc abi-tags specification to distinguish between Linux and
kFreeBSD.

Add explicit handling for Debian GNU/kFreeBSD, which runs
on our kernels as well [2].

In {amd64,i386}/trap.c, when checking osrel of the current process,
also check the ABI to not change the signal behaviour for Linux
binary processes, now that we save an osrel version for all three
from the lists above in struct proc [2].

These changes make it possible to run FreeBSD, Debian GNU/kFreeBSD
and Linux binaries on the same machine again for at least i386 and
amd64, and no longer break kFreeBSD which was detected as GNU(/Linux).

PR: kern/135468
Submitted by: dchagin [1] (initial patch)
Suggested by: kib [2]
Tested by: Petr Salinger (Petr.Salinger seznam.cz) for kFreeBSD
Reviewed by: kib
Approved by: re (kensmith)


# 89ffc202 24-Aug-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Fix handling of .note.ABI-tag section for GNU systems [1].
Handle GNU/Linux according to LSB Core Specification 4.0,
Chapter 11. Object Format, 11.8. ABI note tag.

Also check the first word of desc, not only name, according to
glibc abi-tags specification to distinguish between Linux and
kFreeBSD.

Add explicit handling for Debian GNU/kFreeBSD, which runs
on our kernels as well [2].

In {amd64,i386}/trap.c, when checking osrel of the current process,
also check the ABI to not change the signal behaviour for Linux
binary processes, now that we save an osrel version for all three
from the lists above in struct proc [2].

These changes make it possible to run FreeBSD, Debian GNU/kFreeBSD
and Linux binaries on the same machine again for at least i386 and
amd64, and no longer break kFreeBSD which was detected as GNU(/Linux).

PR: kern/135468
Submitted by: dchagin [1] (initial patch)
Suggested by: kib [2]
Tested by: Petr Salinger (Petr.Salinger seznam.cz) for kFreeBSD
Reviewed by: kib
MFC after: 3 days


# cd899aad 05-Apr-2009 Dmitry Chagin <dchagin@FreeBSD.org>

Fix KBI breakage by r190520 which affects older linux.ko binaries:

1) Move the new field (brand_note) to the end of the Brandinfo structure.
2) Add a new flag BI_BRAND_NOTE that indicates that the brand_note pointer
is valid.
3) Use the brand_note field if the flag BI_BRAND_NOTE is set and as old
modules won't have the flag set, so the new field brand_note would be
ignored.

Suggested by: jhb
Reviewed by: jhb
Approved by: kib (mentor)
MFC after: 6 days


# 267c52fc 22-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Fix several issues with parsing the notes for ELF objects.

Badly formed ELF note may cause the caclulated pointer to the next note
to point both after the note region, that was checked in the code, but
also to point before the region, that was not checked [1]. Remember the
first note location in note0 and leap out if the note is not between
note0 and note_end.

In the similar way, badly formed note may cause infinite loop by
pointing next note into the same or previous note. Guard against this by
limiting amount of loop iterations by arbitrary choosen big number.

For clarity, check the calculated note alignment in each iteration.

Reported by: Chris Palmer <chris noncombatant org> [1]
PR: kern/132886
Reviewed and tested by: dchagin
MFC after: 3 days


# 3ff06357 16-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Supply AT_EXECPATH auxinfo entry to the interpreter, both for native and
compat32 binaries.

Tested by: pho
Reviewed by: kan


# 429f5a58 17-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Use the properly sized types for ELF object header and program headers.
This fixes osrel fetching from the FreeBSD branding note for the 64bit
platforms.

Reported by: swell.k gmail com
Reviewed by: dchagin
Tested by: dchagin, swell.k gmail com


# 32c01de2 13-Mar-2009 Dmitry Chagin <dchagin@FreeBSD.org>

Implement new way of branding ELF binaries by looking to a
".note.ABI-tag" section.

The search order of a brand is changed, now first of all the
".note.ABI-tag" is looked through.

Move code which fetch osreldate for ELF binary to check_note() handler.

PR: 118473
Approved by: kib (mentor)


# 95c807cf 24-Jan-2009 Robert Watson <rwatson@FreeBSD.org>

When a statically linked binary is executed (or at least, one without
an interpreter definition in its program header), set the auxiliary
ELF argument AT_BASE to 0 rather than to the address that we would
have mapped the interpreter at if there had been one.

The ELF ABI specifications appear to be ambiguous as to the desired
behavior in this situation, as they define AT_BASE as the base address
of the interpreter, but do not mention what to do if there is none.
On Solaris, AT_BASE will be set to the base address of the static
binary if there is no interpreter, and on Linux, AT_BASE is set to 0.
We go with the Linux semantics as they are of more immediate utility
and allow the early runtime environment to know that the kernel has
not mapped an interpreter, but because AT_PHDR points at the ELF
header for the running binary, it is still possible to retrieve all
required mapping information when the process starts should it be
required. Either approach would be preferable to our current behavior
of passing a pointer to an unmapped region of user memory as AT_BASE.

MFC after: 3 weeks


# a3ac8c94 17-Dec-2008 Peter Wemm <peter@FreeBSD.org>

Remove sysctl debug.elf_trace and the trace field in auxargs. They go
nowhere. It used to be the equivalent of $LD_DEBUG in rtld-elf.
Elf_Auxargs is an internal structure.


# 35c2a5a8 17-Dec-2008 Warner Losh <imp@FreeBSD.org>

Minor style(9) nit.


# 6f347545 17-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

Remove two remnant uses of AT_DEBUG.


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 387ad998 08-Oct-2008 Konstantin Belousov <kib@FreeBSD.org>

If the ABI-overriden interpreter was not loaded, do not set
have_interp to TRUE. This allows the code in image activator to try
/libexec/ld-elf.so.1 as interpreter when newinterp is not found to
execute.

Reviewed by: peter
MFC after: 2 weeks (together with r175105)


# ccd3953e 14-May-2008 John Baldwin <jhb@FreeBSD.org>

Go back to using the process command name (p_comm) for the file name and
command line arguments stored in the note at the beginning of a core dump
instead of the current thread name.

Reviewed by: julian


# 6617724c 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 4113f8d7 05-Jan-2008 Peter Wemm <peter@FreeBSD.org>

Fall back to the binary-specified interpreter (ld-elf.so.1) if the
ABI override binary isn't found. This could probably be smoother, but
it is what I did in p4 change #126891 on 2007/09/27. It should solve
the "ld-elf32.so.1"-in-chroot problem.


# f231de47 03-Dec-2007 Konstantin Belousov <kib@FreeBSD.org>

Implement fetching of the __FreeBSD_version from the ELF ABI-tag note.
The value is read into the p_osrel member of the struct proc. p_osrel
is set to 0 for the binaries without the note.

MFC after: 3 days


# 93d1c728 03-Dec-2007 Konstantin Belousov <kib@FreeBSD.org>

Check for the program headers alignment of the ELF images before
dereferencing. Unaligned access could cause panic on strict alignment
architectures.

Reviewed by: marcel, marius (also tested on sparc64, thanks !)
MFC after: 3 days


# e01eafef 13-Nov-2007 Julian Elischer <julian@FreeBSD.org>

A bunch more files that should probably print out a thread name
instead of a process name.


# 89b57fcf 05-Nov-2007 Konstantin Belousov <kib@FreeBSD.org>

Fix for the panic("vm_thread_new: kstack allocation failed") and
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.

As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.

The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).

In collaboration with: Peter Holm
Reviewed by: jhb


# 19059a13 14-May-2007 John Baldwin <jhb@FreeBSD.org>

Rework the support for ABIs to override resource limits (used by 32-bit
processes under 64-bit kernels). Previously, each 32-bit process overwrote
its resource limits at exec() time. The problem with this approach is that
the new limits affect all child processes of the 32-bit process, including
if the child process forks and execs a 64-bit process. To fix this, don't
ovewrite the resource limits during exec(). Instead, sv_fixlimits() is
now replaced with a different function sv_fixlimit() which asks the ABI to
sanitize a single resource limit. We then use this when querying and
setting resource limits. Thus, if a 32-bit process sets a limit, then
that new limit will be inherited by future children. However, if the
32-bit process doesn't change a limit, then a future 64-bit child will
see the "full" 64-bit limit rather than the 32-bit limit.

MFC is tentative since it will break the ABI of old linux.ko modules (no
other modules are affected).

MFC after: 1 week


# 4f506694 17-Jan-2007 Xin LI <delphij@FreeBSD.org>

Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.


# 976a87a2 19-Nov-2006 Alan Cox <alc@FreeBSD.org>

Add vm map and object locking to each_writable_segment().

Noticed by: jhb@
MFC after: 3 weeks


# e5e6093b 21-Jan-2006 Alan Cox <alc@FreeBSD.org>

Avoid a vm object reference leak in a rarely used code path.

An executable contains at most one PT_INTERP program header. Therefore,
the loop that searches for it can terminate after it is found rather than
iterating over the entire set of program headers.

Eliminate an unneeded initialization.

Reviewed by: tegge


# d49b2109 26-Dec-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Fix breakage introduced in the previous commit.


# 900b28f9 26-Dec-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Remove kern.elf32.can_exec_dyn sysctl. Instead extend Brandinfo structure
with flags bitfield and set BI_CAN_EXEC_DYN flag for all brands that usually
allow executing elf dynamic binaries (aka shared libraries). When it is
requested to execute ET_DYN elf image check if this flag is on after we
know the elf brand allowing execution if so.

PR: kern/87615
Submitted by: Marcin Koziej <creep@desk.pl>


# 60bb3943 23-Dec-2005 Alan Cox <alc@FreeBSD.org>

Maintain the lock on the vnode for most of exec_elfN_imgact().
Specifically, it is required for the I/O that may be performed by
elfN_load_section().

Avoid an obscure deadlock in the a.out, elf, and gzip image
activators. Add a comment describing why the deadlock does not occur
in the common case and how it might occur in less usual circumstances.

Eliminate an unused variable from exec_aout_imgact().

In collaboration with: tegge


# 373d1a3f 21-Dec-2005 Alan Cox <alc@FreeBSD.org>

Maintain the vnode lock throughout elfN_load_file() rather than releasing
it and reacquiring it in vrele(). Consequently, there is no reason to
increase the reference count on the vm object caching the file's pages.
Reviewed by: tegge

Eliminate unused parameters to elfN_load_file().


# ff6f03c7 20-Dec-2005 Alan Cox <alc@FreeBSD.org>

Eliminate an unneeded (vm_prot_t) parameter from two functions. Eliminate
unnecessary uses of a local variable.

Reviewed by: tegge


# 044bbbb5 17-Dec-2005 Alan Cox <alc@FreeBSD.org>

Correct a long-standing problem in elfN_map_insert(): In order to copy a
page to user space, the user space mapping must allow write access.

In collaboration with: tegge@
MFC after: 3 weeks


# 584716b0 16-Dec-2005 Alan Cox <alc@FreeBSD.org>

Style: The second argument to vm_map_find() should be NULL instead of 0.


# da61b9a6 16-Dec-2005 Alan Cox <alc@FreeBSD.org>

Use sf_buf_alloc() instead of vm_map_find() on exec_map to create the
ephemeral mappings that are used as the source for three copy
operations from kernel space to user space. There are two reasons for
making this change: (1) Under heavy load exec_map can fill up causing
vm_map_find() to fail. When it fails, the nascent process is aborted
(SIGABRT). Whereas, this reimplementation using sf_buf_alloc()
sleeps. (2) Although it is possible to sleep on vm_map_find()'s
failure until address space becomes available (see kmem_alloc_wait()),
using sf_buf_alloc() is faster. Furthermore, the reimplementation
uses a CPU private mapping, avoiding a TLB shootdown on
multiprocessors.

Problem uncovered by: kris@
Reviewed by: tegge@
MFC after: 3 weeks


# 481a1fe1 14-Nov-2005 Olivier Houchard <cognet@FreeBSD.org>

Add a new sysctl, kern.elf[32|64].can_exec_dyn. When set to 1, one can
execute a ET_DYN binary (shared object).
This does not make much sense, but some linux scripts expect to be able to
execute /lib/ld-linux.so.2 (ldd comes to mind).
The sysctl defaults to 0.

MFC after: 3 days


# 5f419982 28-Sep-2005 Robert Watson <rwatson@FreeBSD.org>

Back out alpha/alpha/trap.c:1.124, osf1_ioctl.c:1.14, osf1_misc.c:1.57,
osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60,
svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81,
svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55,
svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10,
ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58,
unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133:

Now that Giant is acquired in uprintf() and tprintf(), the caller no
longer leads to acquire Giant unless it also holds another mutex that
would generate a lock order reversal when calling into these functions.
Specifically not backed out is the acquisition of Giant in nfs_socket.c
and rpcclnt.c, where local mutexes are held and would otherwise violate
the lock order with Giant.

This aligns this code more with the eventual locking of ttys.

Suggested by: bde


# 84d2b7df 19-Sep-2005 Robert Watson <rwatson@FreeBSD.org>

Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(),
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).

Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions. In most cases, this means adding an
acquisition of Giant immediately around the function. In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.

With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.

NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.

NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset. This needs to be fixed, but was a problem before
this change.

NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.

MFC after: 1 week


# 68ff2a43 15-Sep-2005 Christian S.J. Peron <csjp@FreeBSD.org>

Improve the MP safeness associated with the creation of symbolic
links and the execution of ELF binaries. Two problems were found:

1) The link path wasn't tagged as being MP safe and thus was not properly
protected.
2) The ELF interpreter vnode wasnt being locked in namei(9) and thus was
insufficiently protected.

This commit makes the following changes:

-Sets the MPSAFE flag in NDINIT for symbolic link paths
-Sets the MPSAFE flag in NDINIT and introduce a vfslocked variable which
will be used to instruct VFS_UNLOCK_GIANT to unlock Giant if it has been
picked up.
-Drop in an assertion into vfs_lookup which ensures that if the MPSAFE
flag is NOT set, that we have picked up giant. If not panic (if WITNESS
compiled into the kernel). This should help us find conditions where vnode
operations are in-sufficiently protected.

This is a RELENG_6 candidate.

Discussed with: jeff
MFC after: 4 days


# 62919d78 30-Jun-2005 Peter Wemm <peter@FreeBSD.org>

Jumbo-commit to enhance 32 bit application support on 64 bit kernels.
This is good enough to be able to run a RELENG_4 gdb binary against
a RELENG_4 application, along with various other tools (eg: 4.x gcore).
We use this at work.

ia32_reg.[ch]: handle the 32 bit register file format, used by ptrace,
procfs and core dumps.
procfs_*regs.c: vary the format of proc/XXX/*regs depending on the client
and target application.
procfs_map.c: Don't print a 64 bit value to 32 bit consumers, or their
sscanf fails. They expect an unsigned long.
imgact_elf.c: produce a valid 32 bit coredump for 32 bit apps.
sys_process.c: handle 32 bit consumers debugging 32 bit targets. Note
that 64 bit consumers can still debug 32 bit targets.

IA64 has got stubs for ia32_reg.c.

Known limitations: a 5.x/6.x gdb uses get/setcontext(), which isn't
implemented in the 32/64 wrapper yet. We also make a tiny patch to
gdb pacify it over conflicting formats of ld-elf.so.1.

Approved by: re


# 77f30fff 24-May-2005 Olivier Houchard <cognet@FreeBSD.org>

Don't set the default of kern.fallback_elf_brand to FreeBSD for arm, as
binutils now do the job for us


# e11a45c9 03-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- Neither of our image formats require Giant now that the vm and vfs have
been locked.


# 9f65fb13 03-Apr-2005 Alan Cox <alc@FreeBSD.org>

Remove GIANT_REQUIRED from elfN_load_section().


# 610ecfe0 29-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

o Split out kernel part of execve(2) syscall into two parts: one that
copies arguments into the kernel space and one that operates
completely in the kernel space;

o use kernel-only version of execve(2) to kill another stackgap in
linuxlator/i386.

Obtained from: DragonFlyBSD (partially)
MFC after: 2 weeks


# 8516dd18 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use VOP_GETVOBJECT, use vp->v_object directly.


# e0370a18 23-Sep-2004 Olivier Houchard <cognet@FreeBSD.org>

On arm, set the default elf brand to FreeBSD, until the binutils do it for us.


# 4da47b2f 10-Aug-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Add __elfN(dump_thread). This function is called from __elfN(coredump)
to allow dumping per-thread machine specific notes. On ia64 we use this
function to flush the dirty registers onto the backingstore before we
write out the PRSTATUS notes.

Tested on: alpha, amd64, i386, ia64 & sparc64
Not tested on: arm, powerpc


# cfaf7e60 08-Aug-2004 Doug Rabson <dfr@FreeBSD.org>

Make sure that AT_PHDR has a useful value even for static programs.


# 1f7a1baa 18-Jul-2004 Marcel Moolenaar <marcel@FreeBSD.org>

After maintaining previous behaviour in writing out the core notes, it's
time now to break with the past: do not write the PID in the first note.
Rationale:
1. [impact of the breakage] Process IDs in core files serve no immediate
purpose to the debugger itself. They are only useful to relate a core
file to a process. This can provide context to the person looking at
the core file, provided one keeps track of this. Overall, not having
the PID in the core file is only in very rare occasions unfortunate.
2. [reason of the breakage] Having one PRSTATUS note contain the PID,
while all others contain the LWPID of the corresponding kernel thread
creates an irregularity for the debugger that cannot easily be worked
around. This is caused by libthread_db correlating user thread IDs to
kernel thread (aka LWP) IDs and thus aware of the actual LWPIDs.

Update comments accordingly.


# 247aba24 26-Jun-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Allocate TIDs in thread_init() and deallocate them in thread_fini().
The overhead of unconditionally allocating TIDs (and likewise,
unconditionally deallocating them), is amortized across multiple
thread creations by the way UMA makes it possible to have type-stable
storage.
Previously the cost was kept down by having threads created as part
of a fork operation use the process' PID as the TID. While this had
some nice properties, it also introduced complexity in the way TIDs
were allocated. Most importantly, by using the type-stable storage
that UMA gives us this was also unnecessary.

This change affects how core dumps are created and in particular how
the PRSTATUS notes are dumped. Since we don't have a thread with a
TID equalling the PID, we now need a different way to preserve the
old and previous behavior. We do this by having the given thread (i.e.
the thread passed to the core dump code in td) dump it's state first
and fill in pr_pid with the actual PID. All other threads will have
pr_pid contain their TIDs. The upshot of all this is that the debugger
will now likely select the right LWP (=TID) as the initial thread.

Credits to: julian@ for spotting how we can utilize UMA.
Thanks to: all who provided julian@ with test results.


# f99619a0 04-Jun-2004 Tim J. Robbins <tjr@FreeBSD.org>

Change the types of vn_rdwr_inchunks()'s len and aresid arguments to
size_t and size_t *, respectively. Update callers for the new interface.
This is a better fix for overflows that occurred when dumping segments
larger than 2GB to core files.


# 2b471bc6 04-Jun-2004 Tim J. Robbins <tjr@FreeBSD.org>

Back out workaround for vn_rdwr_inchunks()'s INT_MAX length limitation
after discussions with bde; vn_rdwr_inchunks() itself should be fixed.


# 16e6d162 04-Jun-2004 Tim J. Robbins <tjr@FreeBSD.org>

Write segments to core dump files in maximally-sized chunks that neither
exceed vn_rdwr_inchunks()'s INT_MAX length limitation nor span a block
boundary. This fixes dumping segments larger than 2GB.

PR: 67546


# 59c8bc40 22-Apr-2004 Alan Cox <alc@FreeBSD.org>

Utilize sf_buf_alloc() rather than pmap_qenter() (and sometimes
kmem_alloc_wait()) for mapping the image header. On all machines with a
direct virtual-to-physical mapping and SMP/HTT i386s, this is a clear win.


# ece267ba 08-Apr-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Do not assume that the initial thread (i.e. the thread with the ID
equal to the process ID) is still present when we dump a core. It
already may have been destroyed. In that case we would end up
dereferencing a NULL pointer, so specifically test for that as well.

Reported & tested by: Dan Nelson <dnelson@allantgroup.com>


# 8c9b7b2c 03-Apr-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Create NT_PRSTATUS and NT_FPREGSET notes for each and every thread
in the process. This is required for proper debugging of corefiles
created by 1:1 or M:N threaded processes. Add an XXX comment where
we should actually call a function that dumps MD specific notes.
An example of a MD specific note is the NT_PRXFPREG note for SSE
registers.

Since BFD creates non-annotated pseudo-sections for the first PRSTATUS
and FPREGSET notes (non-annotated in the sense that the name of the
section does not contain the pid/tid), make sure those sections describe
the initial thread of the process (i.e. the thread which tid equals the
pid). This is not strictly necessary, but makes sure that tools that use
the non-annotated section names will not change behaviour due to this
change.

The practical upshot of this all is that one can see the threads in
the debugger when looking at a corefile. For 1:1 threading this means
that *all* threads are visible.


# 3dc19c46 18-Mar-2004 Jacques Vidrine <nectar@FreeBSD.org>

Verify more bits of the ELF header: the program header table
entry size and the ELF version. Also, avoid a potential integer
overflow when determining whether the ELF header fits entirely
within the first page.

Reviewed by: jdp

A panic when attempting to execute an ELF binary with a bogus program
header table entry size was

Reported by: Christer Öberg <christer.oberg@texonet.com>


# 91d5354a 04-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 9b68618d 22-Dec-2003 Peter Wemm <peter@FreeBSD.org>

Add an additional field to the elf brandinfo structure to support
quicker exec-time replacement of the elf interpreter on an emulation
environment where an entire /compat/* tree isn't really warranted.


# c460ac3a 24-Sep-2003 Peter Wemm <peter@FreeBSD.org>

Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit
systems where the data/stack/etc limits are too big for a 32 bit process.

Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.

Supply an ia32_fixlimits function. Export the clip/default values to
sysctl under the compat.ia32 heirarchy.

Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable. This allows mmap to
place mappings at sensible locations when limits have been reduced.

Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.

Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'. And that causes exec etc to fail since it can no
longer find space to mmap things.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# a063facb 31-May-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Fix ia32 compat on ia64. Recent ia64 MD changes caused the garbage on
the stack to be changed in a way incompatible with elf32_map_insert()
where we used data_buf without initializing it for when the partial
mapping resulting in a misaligned image (typical when the page size
implied by the image is not the same as the page size in use by the
kernel). Since data_buf is passed by reference to vm_map_find(), the
compiler cannot warn about it.

While here, move all local variables to the top of the function.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# e548a1d4 04-Jan-2003 Jake Burkholder <jake@FreeBSD.org>

- Provide backwards compatibility for kern.fallback_elf_brand.
- Use the generic elf type macros in imgact_elf.h instead of ifdefing the
entire contents of the header.


# a360a43d 04-Jan-2003 Jake Burkholder <jake@FreeBSD.org>

Improve the way that an elf image activator for an alternate word size is
included in the kernel. Include imgact_elf.c in conf/files, instead of
both imgact_elf32.c and imgact_elf64.c, which will use the default word
size for an architecture as defined in machine/elf.h. Architectures that
wish to build an additional image activator for an alternate word size can
include either imgact_elf32.c or imgact_elf64.c in files.${ARCH}, which
allows it to be dependent on MD options instead of solely on architecture.

Glanced at by: peter


# 551d79e1 20-Dec-2002 Marcel Moolenaar <marcel@FreeBSD.org>

Fix multiple registration of the elf_legacy_coredump sysctl variable.
The duplication is caused by the fact that imgact_elf.c is included
by both imgact_elf32.c and imgact_elf64.c and both are compiled by
default on ia64. Consequently, we have two seperate copies of the
elf_legacy_coredump variable due to them being declared static, and
two entries for the same sysctl in the linker set, both referencing
the unique copy of the elf_legacy_coredump variable. Since the second
sysctl cannot be registered, one of the elf_legacy_coredump variables
can not be tuned (if ordering still holds, it's the ELF64 related one).

The only solution is to create two different sysctl variables, just
like the elf<32|64>_trace sysctl variables. This unfortunately is an
(user) interface change, but unavoidable. Thus, on ELF32 platforms
the sysctl variable is called elf32_legacy_coredump and on ELF64
platforms it is called elf64_legacy_coredump. Platforms that have
both ELF formats have both sysctl variables.

These variables should probably be retired sooner rather than later.


# fa7dd9c5 16-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

Change the way ELF coredumps are handled. Instead of unconditionally
skipping read-only pages, which can result in valuable non-text-related
data not getting dumped, the ELF loader and the dynamic loader now mark
read-only text pages NOCORE and the coredump code only checks (primarily) for
complete inaccessibility of the page or NOCORE being set.

Certain applications which map large amounts of read-only data will
produce much larger cores. A new sysctl has been added,
debug.elf_legacy_coredump, which will revert to the old behavior.

This commit represents collaborative work by all parties involved.
The PR contains a program demonstrating the problem.

PR: kern/45994
Submitted by: "Peter Edwards" <pmedwards@eircom.net>, Archie Cobbs <archie@dellroad.org>
Reviewed by: jdp, dillon
MFC after: 7 days


# 6d7bdc8d 08-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Assign value of NULL to imgp->execlabel when imgp is initialized
in the ELF code. Missed in earlier merge from the MAC tree.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 450ffb44 04-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Remove reference to struct execve_args from struct imgact, which
describes an image activation instance. Instead, make use of the
existing fname structure entry, and introduce two new entries,
userspace_argv, and userspace_envv. With the addition of
mac_execve(), this divorces the image structure from the specifics
of the execve() system call, removes a redundant pointer, etc.
No semantic change from current behavior, but it means that the
structure doesn't depend on syscalls.master-generated includes.

There seems to be some redundant initialization of imgact entries,
which I have maintained, but which could probably use some cleaning
up at some point.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 96725dd0 22-Oct-2002 Alexander Kabaev <kan@FreeBSD.org>

Handle binaries with arbitrary number PT_LOAD sections, not only
ones with one text and one data section.

The text and data rlimit checks still needs to be fixed to properly
accout for additional sections.

Reviewed by: peter (slightly different patch version)


# e80fb434 17-Oct-2002 Robert Drehmel <robert@FreeBSD.org>

Use strlcpy() instead of strncpy() to copy NUL terminated strings
for safety and consistency.


# 05ba50f5 21-Sep-2002 Jake Burkholder <jake@FreeBSD.org>

Use the fields in the sysentvec and in the vm map header in place of the
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.


# d0ca7c29 07-Sep-2002 Peter Wemm <peter@FreeBSD.org>

Do not blow up when we walk off the end of the brands list.

Found by: kris, jake


# 21c2d047 03-Sep-2002 Matthew Dillon <dillon@FreeBSD.org>

Alright, fix the problems with the elf loader for the Alpha. It turns
out that there is no easy way to discern the difference between a text
segment and a data segment through the read-only OR execute attribute
in the elf segment header, so revert the algorithm to what it was before.

Neither can we account for multiple data load segments in the vmspace
structure (at least not without more work), due to assumptions obreak()
makes in regards to the data start and data size fields.

Retain RLIMIT_VMEM checking by using a local variable to track the
total bytes of data being loaded.

Reviewed by: peter
X-MFC after: ASAP


# 9782ecba 03-Sep-2002 Peter Wemm <peter@FreeBSD.org>

Make the text segment locating heuristics from rev 1.121 more reliable
so that it works on the Alpha. This defines the segment that the entry
point exists in as 'text' and any others (usually one) as data.

Submitted by: tmm
Tested on: i386, alpha


# 05ef8798 02-Sep-2002 Matthew Dillon <dillon@FreeBSD.org>

Grammer cleanup


# 5fe3ed62 01-Sep-2002 Jake Burkholder <jake@FreeBSD.org>

Moved elf brand identification into a function. Fully identify the
brand early in the process of loading an elf file, so that we can
identify the sysentvec, and so that we do not continue if we do not
have a brand (and thus a sysentvec). Use the values in the sysentvec
for the page size and vm ranges unconditionally, since they are all
filled in now.


# 8cf03452 01-Sep-2002 Jake Burkholder <jake@FreeBSD.org>

Fixed more indentation bugs.


# cac45152 30-Aug-2002 Matthew Dillon <dillon@FreeBSD.org>

Implement data, text, and vmem limit checking in the elf loader and svr4
compat code. Clean up accounting for multiple segments. Part 1/2.

Submitted by: Andrey Alekseyev <uitm@zenon.net> (with some modifications)
MFC after: 3 days


# 81f223ca 25-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

Fixed most indentation bugs.


# ca0387ef 25-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

Fixed placement of operators. Wrapped long lines.


# fd559a8a 24-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

Fixed white space around operators, casts and reserved words.

Reviewed by: md5


# a7cddfed 24-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

return x; -> return (x);
return(x); -> return (x);

Reviewed by: md5


# 9ca43589 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 619eb6e5 13-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Hold the vnode lock throughout execve.
- Set VV_TEXT in the top level execve code.
- Fixup the image activators to deal with the newly locked vnode.


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 3ebc1248 19-Jul-2002 Peter Wemm <peter@FreeBSD.org>

Infrastructure tweaks to allow having both an Elf32 and an Elf64 executable
handler in the kernel at the same time. Also, allow for the
exec_new_vmspace() code to build a different sized vmspace depending on
the executable environment. This is a big help for execing i386 binaries
on ia64. The ELF exec code grows the ability to map partial pages when
there is a page size difference, eg: emulating 4K pages on 8K or 16K
hardware pages.

Flesh out the i386 emulation support for ia64. At this point, the only
binary that I know of that fails is cvsup, because the cvsup runtime
tries to execute code in pages not marked executable.

Obtained from: dfr (mostly, many tweaks from me).


# 0b2ed1ae 06-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Clean up execve locking:

- Grab the vnode object early in exec when we still have the vnode lock.
- Cache the object in the image_params.
- Make use of the cached object in imgact_*.c


# 21dc7d4f 02-Jun-2002 Jens Schweikhardt <schweikh@FreeBSD.org>

Fix typo in the BSD copyright: s/withough/without/

Spotted and suggested by: des
MFC after: 3 weeks


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# bf43c504 16-Dec-2001 Mark Peek <mp@FreeBSD.org>

Remove whitespace at end of line.


# cbc89bfb 10-Oct-2001 Paul Saab <ps@FreeBSD.org>

Make MAXTSIZ, DFLDSIZ, MAXDSIZ, DFLSSIZ, MAXSSIZ, SGROWSIZ loader
tunable.

Reviewed by: peter
MFC after: 2 weeks


# 3418ebeb 26-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

Make uio_yield() a global. Call uio_yield() between chunks
in vn_rdwr_inchunks(), allowing other processes to gain an exclusive
lock on the vnode. Specifically: directory scanning, to avoid a race to the
root directory, and multiple child processes coring simultaniously so they
can figure out that some other core'ing child has an exclusive adv lock and
just exit instead.

This completely fixes performance problems when large programs core. You
can have hundreds of copies (forked children) of the same binary core all
at once and not notice.

MFC after: 3 days


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 06ae1e91 08-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

This brings in a Yahoo coredump patch from Paul, with additional mods by
me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that
if you have large process images core dumping, or you have a large number of
forked processes all core dumping at the same time, the original coredump code
would leave the vnode locked throughout. This can cause the directory vnode
to get locked up, which can cause the parent directory vnode to get locked
up, and so on all the way to the root node, locking the entire machine up
for extremely long periods of time.

This patch solves the problem in two ways. First it uses an advisory
non-blocking lock to abort multiple processes trying to core to the same
file. Second (my contribution) it chunks up the writes and uses bwillwrite()
to avoid holding the vnode locked while blocking in the buffer cache.

Submitted by: ps
Reviewed by: dillon
MFC after: 2 weeks


# ef4181d9 01-Sep-2001 Peter Wemm <peter@FreeBSD.org>

For ia64, set the default elf brand to be FreeBSD. This is temporarily
necessary only for as long as we're using a linux toolchain.


# 546a92c4 28-Aug-2001 Brian Somers <brian@FreeBSD.org>

OR M_WAITOK with M_ZERO in malloc()s args for clarity.


# 29b7fbd1 17-Aug-2001 Mark Peek <mp@FreeBSD.org>

Unbreak linux compatibility by providing the correct length of the buffer.

Reported by: "Pierre Y. Dampure" <pierre.dampure@westmarsh.com>,
"Niels Chr. Bank-Pedersen" <ncbp@bank-pedersen.dk>
Pointy hat to: mp


# a75a0c55 16-Aug-2001 Peter Wemm <peter@FreeBSD.org>

Don't explicitly null-terminate. The buffer we are copying into is
already zeroed, and we explicitly leave the last byte untouched.

Submitted by: bde


# 911c2be0 16-Aug-2001 Mark Peek <mp@FreeBSD.org>

Reduce stack allocation (stack-fast?).
elf_load_file() => 352 to 52 bytes
exec_elf_imgact() => 1072 to 48 bytes
elf_corehdr() => 396 to 8 bytes

Reviewed by: julian


# 6eef6816 16-Aug-2001 Peter Wemm <peter@FreeBSD.org>

Use explicit sizes for the prpsinfo command length string so that
we dont have any more unexpected changes in core dumps. This gets us
back to the original core dump layout from a few days ago.


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 613c83cb 23-May-2001 John Baldwin <jhb@FreeBSD.org>

Lock the VM while twiddling the vmspace.


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 1005a129 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Convert the allproc and proctree locks from lockmgr locks to sx locks.


# f34fa851 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Catch up to header include changes:
- <sys/mutex.h> now requires <sys/systm.h>
- <sys/mutex.h> and <sys/sx.h> now require <sys/lock.h>


# 828c9e13 04-Mar-2001 David E. O'Brien <obrien@FreeBSD.org>

Do not set a default ELF syscall ABI fallback.
If one runs an un-branded Linux static binary that calls Linux's fcntl
the machine will reboot when interupted by the FreeBSD syscall ABI.


# 21a3ee0e 24-Feb-2001 David E. O'Brien <obrien@FreeBSD.org>

MFS: bring the consistent `compat_3_brand' support into -CURRENT
(the work was first done in the RELENG_4 branch near a release
during a MFC to make the code cleaner and more consistent)


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# ba88dfc7 26-Jan-2001 John Baldwin <jhb@FreeBSD.org>

Back out proc locking to protect p_ucred for obtaining additional
references along with the actual obtaining of additional references.


# 611d9407 23-Jan-2001 John Baldwin <jhb@FreeBSD.org>

Proc locking.


# c0c25570 12-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.


# 553629eb 22-Nov-2000 Jake Burkholder <jake@FreeBSD.org>

Protect the following with a lockmgr lock:

allproc
zombproc
pidhashtbl
proc.p_list
proc.p_hash
nextpid

Reviewed by: jhb
Obtained from: BSD/OS and netbsd


# 806d7daa 09-Nov-2000 Marcel Moolenaar <marcel@FreeBSD.org>

Make MINSIGSTKSZ machine dependent, and have the sigaltstack
syscall compare against a variable sv_minsigstksz in struct
sysentvec as to properly take the size of the machine- and
ABI dependent struct sigframe into account.

The SVR4 and iBCS2 modules continue to have a minsigstksz of
8192 to preserve behavior. The real values (if different) are
not known at this time. Other ABI modules use the real
values.

The native MINSIGSTKSZ is now defined as follows:

Arch MINSIGSTKSZ
---- -----------
alpha 4096
i386 2048
ia64 12288

Reviewed by: mjacob
Suggested by: bde


# 00910f28 05-Nov-2000 David E. O'Brien <obrien@FreeBSD.org>

ELF kernels should use an ELF sysvec. This allows us to move a.out
specific files to those platforms that acutally support a.out.


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# a18b1f1d 03-Oct-2000 Jason Evans <jasone@FreeBSD.org>

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


# 9ff5ce6b 12-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# 36240ea5 10-Sep-2000 Doug Rabson <dfr@FreeBSD.org>

Move the include of <sys/systm.h> so that KTR gets a declaration for
snprintf().


# 55af4c7d 23-Jul-2000 Brian Feldman <green@FreeBSD.org>

Using an atomic operation here won't help if nobody else uses them (for
this). Use the simple_lock() on v_interlock like elsewhere.


# 25ead034 23-Jul-2000 Brian Feldman <green@FreeBSD.org>

Solve the problem where it is possible to get the kernel stuck in
a loop down in pmap_init_pt(). A subtraction causes the number of
pages to become negative, that was assigned to an unsigned variable,
and there is a lot of iteration. The bug is due to the ELF image
activator not properly checking for its files being the correct size
as specified by the ELF header.

The solution is to check that the header doesn't ask for part of a
file when that part of the file doesn't exist. Make sure to set
VEXEC at the proper times to make the executables immutable (remove
race conditions). Also, the ELF format specifiies header entries
that allow embedding of other executables (hence how ld-elf.so.1
gets loaded, but not the same as loading shared libraries), so those
executables need to be set VEXEC, too, so they're immutable.

Reviewed by: peter


# 2c9b67a8 30-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unneeded #include <vm/vm_zone.h>

Generated by: src/tools/tools/kerninclude


# c815a20c 17-Apr-2000 David E. O'Brien <obrien@FreeBSD.org>

Change our ELF binary branding to something more acceptable to the Binutils
maintainers.

After we established our branding method of writing upto 8 characters of
the OS name into the ELF header in the padding; the Binutils maintainers
and/or SCO (as USL) decided that instead the ELF header should grow two new
fields -- EI_OSABI and EI_ABIVERSION. Each of these are an 8-bit unsigned
integer. SCO has assigned official values for the EI_OSABI field. In
addition to this, the Binutils maintainers and NetBSD decided that a better
ELF branding method was to include ABI information in a ".note" ELF
section.

With this set of changes, we will now create ELF binaries branded using
both "official" methods. Due to the complexity of adding a section to a
binary, binaries branded with ``brandelf'' will only brand using the
EI_OSABI method. Also due to the complexity of pulling a section out of an
ELF file vs. poking around in the ELF header, our image activator only
looks at the EI_OSABI header field.

Note that a new kernel can still properly load old binaries except for
Linux static binaries branded in our old method.

*
* For a short period of time, ``ld'' will also brand ELF binaries
* using our old method. This is so people can still use kernel.old
* with a new world. This support will be removed before 5.0-RELEASE,
* and may not last anywhere upto the actual release. My expiration
* time for this is about 6mo.
*


# 77ac690c 27-Feb-2000 Paul Saab <ps@FreeBSD.org>

Update a comment in elf_coredump to reflect that if you madvise
with MADV_NOCORE, its address space is also excluded from a core
file.

Pointed out by: alc


# 9730a5da 27-Feb-2000 Paul Saab <ps@FreeBSD.org>

Add MAP_NOCORE to mmap(2), and MADV_NOCORE and MADV_CORE to madvise(2).
This
This feature allows you to specify if mmap'd data is included in
an application's corefile.

Change the type of eflags in struct vm_map_entry from u_char to
vm_eflags_t (an unsigned int).

Reviewed by: dillon,jdp,alfred
Approved by: jkh


# 654f6be1 27-Dec-1999 Bruce Evans <bde@FreeBSD.org>

Changed the type used to represent the user stack pointer from `long *'
to `register_t *'. This fixes bugs like misplacement of argc and argv
on the user stack on i386's with 64-bit longs. We still use longs to
represent "words" like argc and argv, and assume that they are on the
stack (and that there is stack). The suword() and fuword() families
should also use register_t.


# 762e6b85 15-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Introduce NDFREE (and remove VOP_ABORTOP)


# da654d90 20-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

s/p_cred->pc_ucred/p_ucred/g


# a3021f91 19-Nov-1999 Boris Popov <bp@FreeBSD.org>

Vnode was left referenced in the case if ELF image is broken.

Reviewed by: Peter Wemm <peter@netplex.com.au>


# 2e3c8fcb 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This is a partial commit of the patch from PR 14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# d1f088da 11-Oct-1999 Peter Wemm <peter@FreeBSD.org>

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# fca666a1 31-Aug-1999 Julian Elischer <julian@FreeBSD.org>

General cleanup of core-dumping code.

Submitted by: Sean Fagan,


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# d44e4156 26-Aug-1999 Dima Ruban <dima@FreeBSD.org>

Don't follow symlinks on coredumps.

Reviewed by: dillon && security-officer


# bdbc8c26 09-Jul-1999 Peter Wemm <peter@FreeBSD.org>

Fix the previous warning a different way since the emul_path exposure was
intentional. Avoid the warning by propagating the const filename through
to elf_load_file() instead.


# c6bb4a64 09-Jul-1999 Peter Wemm <peter@FreeBSD.org>

Minor tweak - don't cause a warning.
I don't know if it was intentional or not, but it would have printed out:
/compat/linux/foo/bar.so: interpreter not found
If it was, then I've broken it. De-constifying the 'interp' variable
or carrying the constness through to elf_load_file() are alternatives.


# 7a583b02 05-Jul-1999 Marcel Moolenaar <marcel@FreeBSD.org>

Also try to load the interpreter without prepending "emul_path". This allows
dynamicly linked binaries to run in a chroot'd environment with "emul_path"
as the new root. The new behavior of loading interpreters is identical to the
principle of overlaying.

PR: 10145


# e972780a 16-May-1999 Alan Cox <alc@FreeBSD.org>

Add the options MAP_PREFAULT and MAP_PREFAULT_PARTIAL to vm_map_find/insert,
eliminating the need for the pmap_object_init_pt calls in imgact_* and
mmap.

Reviewed by: David Greenman <dg@root.com>


# e5f13bdd 14-May-1999 Alan Cox <alc@FreeBSD.org>

Simplify vm_map_find/insert's interface: remove the MAP_COPY_NEEDED option.

It never makes sense to specify MAP_COPY_NEEDED without also specifying
MAP_COPY_ON_WRITE, and vice versa. Thus, MAP_COPY_ON_WRITE suffices.

Reviewed by: David Greenman <dg@root.com>


# e37622b2 09-May-1999 Peter Wemm <peter@FreeBSD.org>

Fix a couple of warnings and some bitrot in comments.


# c33fe779 20-Feb-1999 John Polstra <jdp@FreeBSD.org>

If you merge this into -stable, please increment __FreeBSD_version
in "src/sys/sys/param.h".

Fix the ELF image activator so that it can handle dynamic linkers
which are executables linked at a fixed address. This improves
compliance with the ABI spec, and it opens the door to possibly
better dynamic linker performance in the future. I've experimented
a bit with a fixed-address dynamic linker, and it works fine. But
I don't have any measurements yet to determine whether it's
worthwhile.

Also, remove a few calculations that were never used for anything.

I will increment __FreeBSD_version, since this adds a new capability
to the kernel that the dynamic linker might some day rely upon.


# b1028ad1 19-Feb-1999 Luoqi Chen <luoqi@FreeBSD.org>

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# 47633640 07-Feb-1999 John Polstra <jdp@FreeBSD.org>

Change the load address of the ELF dynamic linker from "2L*MAXDSIZ"
to an architecture-specific value defined in <machine/elf.h>. This
solves problems on large-memory systems that have a high value for
MAXDSIZ.

The load address is controlled by a new macro ELF_RTLD_ADDR(vmspace).
On the i386 it is hard-wired to 0x08000000, which is the standard
SVR4 location for the dynamic linker.

On the Alpha, the dynamic linker is loaded MAXDSIZ bytes beyond
the start of the program's data segment. This is the same place
a userland mmap(0, ...) call would put it, so it ends up just below
all the shared libraries. The rationale behind the calculation is
that it allows room for the data segment to grow to its maximum
possible size.

These changes have been tested on the i386 for several months
without problems. They have been tested on the Alpha as well,
though not for nearly as long. I would like to merge the changes
into 3.1 within a week if no problems have surfaced as a result of
them.


# 9fdfe602 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove MAP_ENTRY_IS_A_MAP 'share' maps. These maps were once used to
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.


# 6f8126fa 05-Feb-1999 John Polstra <jdp@FreeBSD.org>

Correct an "&" operator which should have been "&&".

Submitted by: mjacob


# 3b351cc1 04-Feb-1999 Mark Newton <newton@FreeBSD.org>

Additional note on last rev: The rationale for this is to allow you
to run Solaris executables (or executables from any other ELF system)
directly off the CD-ROM without having to waste megabytes of disk
by copying them to another filesystem just to brand them.


# f8b3601e 04-Feb-1999 Mark Newton <newton@FreeBSD.org>

Created sysctl kern.fallback_elf_brand. Defaults to "none", which will
give the same behaviour produced before today. If sysadmin sets it
to a valid ELF brand, ELF image activator will attempt to run unbranded
ELF exectutables as if they were branded with that value.

Suggested by: Dima Ruban <dima@best.net>


# 096977fa 03-Feb-1999 Mark Newton <newton@FreeBSD.org>

Provide elf_brand_inuse() as a method an emulator can use to find out
whether it is currently in use (which is kinda useful when it's about
to unload itself: Lockups are never very much fun, are they?).


# 820ca326 29-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

*_execsw static structures cannot be const due to the way they interact
with EXEC_SET, DECLARE_MODULE, and module_register. Specifically,
module_register. We may eventually be able to make these const, but
not now.


# d254af07 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 88c5ea45 25-Jan-1999 Julian Elischer <julian@FreeBSD.org>

Enable Linux threads support by default.
This takes the conditionals out of the code that has been tested by
various people for a while.
ps and friends (libkvm) will need a recompile as some proc structure
changes are made.

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


# 6626c604 18-Dec-1998 Julian Elischer <julian@FreeBSD.org>

Reviewed by: Luoqi Chen, Jordan Hubbard
Submitted by: "Richard Seaman, Jr." <lists@tar.com>
Obtained from: linux :-)

Code to allow Linux Threads to run under FreeBSD.

By default not enabled
This code is dependent on the conditional
COMPAT_LINUX_THREADS (suggested by Garret)
This is not yet a 'real' option but will be within some number of hours.


# 2127f260 04-Dec-1998 Archie Cobbs <archie@FreeBSD.org>

Examine all occurrences of sprintf(), strcat(), and str[n]cpy()
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.

These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>


# f5ef029e 25-Oct-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# 52c24af7 18-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Some cleanups and optimizations:
- Use the system headers method for Elf32/Elf64 symbol compatability
- get rid of the UPRINTF debugging.
- check the ELF header for compatability much more completely
- optimize the section mapper. Use the same direct VM interfaces that
imgact_aout.c and kern_exec.c use.
- Check the return codes from the vm_* functions better. Some return
KERN_* results, not an errno.
- prefault the page tables to reduce startup faults on page tables like
a.out does.
- reset the segment protection to zero for each loop, otherwise each
segment could get progressively more privs. (eg: if the first was
read/write/execute, and the second was meant to be read/execute, the
bug would make the second r/w/x too. In practice this was not a
problem because executables are normally laid out with text first.)
- Don't impose arbitary limits. Use the limits on headers imposed by
the need to fit them into one page.
- Remove unused switch() cases now that the verbose debugging is gone.

I've been using an earlier version of this for a month or so.
This sped up ELF exec speed a bit for me but I found it hard to get
consistant benchmarks when I tested it last (a few weeks ago).
I'm still bothered by the page read out of order caused by the
transition from data to bss. This which requires either part filling the
transition page or clearing the remainder.


# aa855a59 15-Oct-1998 Peter Wemm <peter@FreeBSD.org>

*gulp*. Jordan specifically OK'ed this..

This is the bulk of the support for doing kld modules. Two linker_sets
were replaced by SYSINIT()'s. VFS's and exec handlers are self registered.
kld is now a superset of lkm. I have converted most of them, they will
follow as a seperate commit as samples.
This all still works as a static a.out kernel using LKM's.


# 216a0f2d 15-Oct-1998 Doug Rabson <dfr@FreeBSD.org>

Don't frob the user stack directly, use suword instead. This fixes the
elf_freebsd_fixup() panic which many people have noticed on the alpha.


# 6cde7a16 13-Oct-1998 David Greenman <dg@FreeBSD.org>

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


# d1dbc694 11-Oct-1998 John Polstra <jdp@FreeBSD.org>

If an ELF executable has a recognized brand, then believe it.
Formerly, the heuristic involving the interpreter path took
precedence.

Also, print a better error message if the brand is missing or not
recognized. If there is no brand at all, give the user a hint that
"brandelf" needs to be run.


# 7b4c881c 02-Oct-1998 John Polstra <jdp@FreeBSD.org>

Fix a bug which caused the dynamic linker pathname in the PT_INTERP
program header entry to be ignored if a recognized brand was found.


# 0ff27d31 15-Sep-1998 John Polstra <jdp@FreeBSD.org>

Restore the core-dumping of all writable segments for ELF executables,
minus the NULL pointer dereference in rev. 1.33. Also simplify
things somewhat by eliminating one traversal of the VM map entries.
Finally, eliminate calls to vm_map_{un,}lock_read() which aren't
needed here. I originally took them from procfs_map.c, but here
we know we are dealing only with the map of the current process.


# dada0278 15-Sep-1998 John Polstra <jdp@FreeBSD.org>

Erk. Revert back to 1.31, dumping only data and stack to the core
file, until I can solve a panic that has just cropped up.


# 6bb20c50 15-Sep-1998 John Polstra <jdp@FreeBSD.org>

When choosing segments to write to the core file, don't assume that
writable implies readable.


# 8162da63 15-Sep-1998 John Polstra <jdp@FreeBSD.org>

Instead of just the data and stack segments, include all writable
segments (except memory-mapped devices) in the ELF core file. This
is really nice. You get access to the data areas of all shared
libraries, and even to files that are mapped read-write.

In the future, it might be good to add a new resource limit in the
spirit of RLIMIT_CORE. It would specify the maximum sized writable
segment to include in core dumps. Segments larger than that would
be omitted. This would be useful for programs that map very large
files read/write but that still would like to get usable core dumps.


# 8c64af4f 14-Sep-1998 John Polstra <jdp@FreeBSD.org>

Viola! The kernel now generates standard ELF core dumps for ELF
executables.

Currently only data and stack are included in the core dumps. I am
looking into adding the other (mmapped) writable segments as well.


# 22d4b0fb 13-Sep-1998 John Polstra <jdp@FreeBSD.org>

Add provisions for variant core dump file formats, depending on the
object format of the executable being dumped. This is the first
step toward producing ELF core dumps in the proper format. I will
commit the code to generate the ELF core dumps Real Soon Now. In
the meantime, ELF executables won't dump core at all. That is
probably no less useful than dumping a.out-style core dumps as they
have done until now.

Submitted by: Alex <garbanzo@hooked.net> (with very minor changes by me)


# a9d81f7c 29-Jul-1998 Doug Rabson <dfr@FreeBSD.org>

Default to FreeBSD if no brand detected. This makes life easier when
bootstrapping from NetBSD/alpha.


# 7cd99438 14-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Cast u_longs to uintptr_t before casting them to pointers. Don't
attempt to even partially support systems with function pointers
larger than object pointers.


# ed62fb52 11-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# 2e91d07a 08-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

Fix a typo which prevented i386 elf from working at all (including Linux
emulated elf binaries).


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 288078be 28-Apr-1998 Eivind Eklund <eivind@FreeBSD.org>

Translate T_PROTFLT to SIGSEGV instead of SIGBUS when running under
Linux emulation. This make Allegro Common Lisp 4.3 work under
FreeBSD!

Submitted by: Fred Gilham <gilham@csl.sri.com>
Commented on by: bde, dg, msmith, tg
Hoping he got everything right: eivind


# 3c1300a6 28-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# c8a79999 01-Mar-1998 Peter Wemm <peter@FreeBSD.org>

Update the ELF image activator to use some of the exec resources rather
than rolling it's own. This means that it now uses the "safe"
exec_map_first_page() to get the ld.so headers rather than risking a panic
on a page fault failure (eg: NFS server goes down).
Since all the ELF tools go to a lot of trouble to make sure everything
lives in the first page for executables, this is a win. I have not seen
any ELF executable on any system where all the headers didn't fit in the
first page with lots of room to spare.
I have been running variations of this code for some time on my pure ELF
systems.


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 1560a9d5 20-Sep-1997 Peter Wemm <peter@FreeBSD.org>

We were (I think) missing a vrele() on the vnode for the object loaded
via PT_INTERP (usually /usr/libexec/ld-elf.so.1).


# 5856e12e 12-Apr-1997 John Dyson <dyson@FreeBSD.org>

Fully implement vfork. Vfork is now much much faster than even our
fork. (On my machine, fork is about 240usecs, vfork is 78usecs.)

Implement rfork(!RFPROC !RFMEM), which allows a thread to divorce its memory
from the other threads of a group.

Implement rfork(!RFPROC RFCFDG), which closes all file descriptors, eliminating
possible existing shares with other threads/processes.

Implement rfork(!RFPROC RFFDG), which divorces the file descriptors for a
thread from the rest of the group.

Fix the case where a thread does an exec. It is almost nonsense for a thread
to modify the other threads address space by an exec, so we
now automatically divorce the address space before modifying it.


# d8a4f230 01-Apr-1997 Bruce Evans <bde@FreeBSD.org>

Use OID_AUTO instead of magic number for old sysctl debug.elf_trace. The
magic number conflicted with the one for the Lite2 sysctl debug.busyprt.

Staticized some variables.

Removed unused #includes.


# 3ac4d1ef 22-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# e9822d92 22-Dec-1996 Joerg Wunsch <joerg@FreeBSD.org>

Make DFLDSIZ and MAXDSIZ fully-supported options.

"Don't forget to do a ``make depend''" :-)


# d672246b 24-Oct-1996 Søren Schmidt <sos@FreeBSD.org>

Added a missing break, so all static bins would be missed :(


# 717fb679 16-Oct-1996 Søren Schmidt <sos@FreeBSD.org>

Oops forgot to remove a debug printf.


# ea5a2b2e 16-Oct-1996 Søren Schmidt <sos@FreeBSD.org>

Prepare kernel to take advantage of "branded" ELF binaries.


# 1a7eb2dc 03-Oct-1996 Peter Wemm <peter@FreeBSD.org>

Drop an unused param to unmap_pages().


# e0c95ed9 31-Aug-1996 Bruce Evans <bde@FreeBSD.org>

Fixed the easy cases of const poisoning in the kernel. Cosmetic.


# 6ead3edd 17-Jun-1996 John Dyson <dyson@FreeBSD.org>

Clean-up the new VM map procfs code, and also add support for executable
format file "etype". It contains a description of the binary type for
a process.


# c23670e2 11-Jun-1996 Gary Palmer <gpalmer@FreeBSD.org>

Clean up -Wunused warnings.

Reviewed by: bde


# a794e791 30-Apr-1996 Bruce Evans <bde@FreeBSD.org>

Removed unnecessary #includes from <sys/imgact.h> so that it is
self-sufficient and added explicit #includes where required.


# 71d7d1b1 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Remove references to MAP_FILE.. That is now "default" and is only
a "#define MAP_FILE 0" that is still there for net-2 source compatability.


# 250c11f9 10-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Tweak the data/bss segment page count. The last version worked
with all the test cases I tried, I'm sure this is more correct.

Tweak some prototypes.


# 8191d577 10-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Fix some rounding problems.. In some (fairly rare) situtaions it mapped
one page too many, which caused obreak() to fail in vm_map_find() with
ENOMEM because of the conflicting page.


# e1743d02 10-Mar-1996 Søren Schmidt <sos@FreeBSD.org>

First attempt at FreeBSD & Linux ELF support.

Compile and link a new kernel, that will give native ELF support, and
provide the hooks for other ELF interpreters as well.

To make native ELF binaries use John Polstras elf-kit-1.0.1..
For the time being also use his ld-elf.so.1 and put it in
/usr/libexec.

The Linux emulator has been enhanced to also run ELF binaries, it
is however in its very first incarnation.
Just get some Linux ELF libs (Slackware-3.0) and put them in the
prober place (/compat/linux/...).
I've ben able to run all the Slackware-3.0 binaries I've tried
so far.
(No it won't run quake yet :)