History log of /freebsd-11-stable/sys/kern/kern_jail.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 369335 22-Feb-2021 jamie

MFC jail: Change both root and working directories in jail_attach(2)

jail_attach(2) performs an internal chroot operation, leaving it up to
the calling process to assure the working directory is inside the jail.

Add a matching internal chdir operation to the jail's root. Also
ignore kern.chroot_allow_open_directories, and always disallow the
operation if there are any directory descriptors open.

Reported by: mjg
Approved by: markj, kib

(cherry picked from commit d4380c0cdd0517dc038403dd5c99242ce78bdeb5)

Git Hash: 570121808a76b85b2709502fb15618dd1e5296f1
Git Author: jamie@FreeBSD.org


# 369313 19-Feb-2021 jamie

MFC jail: Handle a possible race between jail_remove(2) and fork(2)

jail_remove(2) includes a loop that sends SIGKILL to all processes
in a jail, but skips processes in PRS_NEW state. Thus it is possible
the a process in mid-fork(2) during jail removal can survive the jail
being removed.

Add a prison flag PR_REMOVE, which is checked before the new process
returns. If the jail is being removed, the process will then exit.
Also check this flag in jail_attach(2) which has a similar issue.

Reported by: trasz
Approved by: kib

(cherry picked from commit cc7b73065302005ebc4a19503188c8d6d5eb923d)

Git Hash: c837631bd47af73d03e3d8907f1e58b88403007c
Git Author: jamie@FreeBSD.org


# 360299 25-Apr-2020 kp

MFC r360068:

ethersubr: Make the mac address generation more robust

If we create two (vnet) jails and create a bridge interface in each we end up
with the same mac address on both bridge interfaces.
These very often conflicts, resulting in same mac address in both jails.

Mitigate this problem by including the jail name in the mac address.


# 359020 16-Mar-2020 bz

MFC r358992:

kern_jail: missing \0 termination check on osrelease parameter

If a user spplies a non-\0 terminated osrelease parameter reading it back
may disclose kernel memory.
This is a problem in case of nested jails (children.max > 0, which is not
the default). Otherwise root outside the jail has access to kernel memory
by other means and root inside a jail cannot create a child jail.

Add the proper \0 check at the end of a supplied osrelease parameter and
make sure any copies of the field will be \0-terminated.

Submitted by: Hans Christian Woithe (chwoithe yahoo.com)


# 339446 20-Oct-2018 jamie

MFC r339409, r339420:

Add a new jail permission, allow.read_msgbuf. When true, jailed processes
can see the dmesg buffer (this is the current behavior). When false (the
new default), dmesg will be unavailable to jailed users, whether root or
not.

The security.bsd.unprivileged_read_msgbuf sysctl still works as before,
controlling system-wide whether non-root users can see the buffer.

PR: 211580
Submitted by: bz


# 339411 17-Oct-2018 jamie

MFC r339211:

Fix the test prohibiting jails from sharing IP addresses.

It's not supposed to be legal for two jails to contain the same IP address,
unless both jails contain only that one address. This is the behavior
documented in jail(8), and is there to prevent confusion when multiple
jails are listening on IADDR_ANY.

VIMAGE jails (now the default for GENERIC kernels) test this correctly,
but non-VIMAGE jails have been performing an incomplete test when nested
jails are used.


# 336083 08-Jul-2018 kib

MFC r335980:
Silence warnings about unused variables when RACCT is defined but RCTL
is not.


# 335536 22-Jun-2018 avg

MFC r332816: call racct_proc_ucred_changed() under the proc lock


# 316943 14-Apr-2017 smh

MFC r303863:

Move IPv4 & IPv6 specific jail functions to netinet and netinet6 files.

Sponsored by: Multiplay


# 315456 17-Mar-2017 vangyzen

MFC r313821 r315277 r315286

Use inet_ntoa_r() instead of inet_ntoa() throughout the kernel.

inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack, except for KTR messages.
KTR can correctly log the immediate integral values passed to it,
as well as constant strings, but not non-constant strings,
since they might change by the time ktrdump retrieves them.
Therefore, use hex notation in KTR messages.

Sponsored by: Dell EMC


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 301764 09-Jun-2016 jamie

Fix a vnode leak when giving a child jail a too-long path when
debug.disablefullpath=1.


# 301760 09-Jun-2016 jamie

Re-order some jail parameter reading to prevent a vnode leak.


# 301758 09-Jun-2016 jamie

Clean up some logic in jail error messages, replacing a missing test and
a redundant test with a single correct test.


# 301745 09-Jun-2016 jamie

Make sure the OSD methods for jail set and remove can't run concurrently,
by holding allprison_lock exclusively (even if only for a moment before
downgrading) on all paths that call PR_METHOD_REMOVE. Since they may run
on a downgraded lock, it's still possible for them to run concurrently
with PR_METHOD_GET, which will need to use the prison lock.


# 300983 30-May-2016 jamie

Mark jail(2), and the sysctls that it (and only it) uses as deprecated.
jail(8) has long used jail_set(2), and those sysctl only cause confusion.


# 298819 29-Apr-2016 pfg

sys/kern: spelling fixes in comments.

No functional change.


# 298683 27-Apr-2016 jamie

Delay revmoing the last jail reference in prison_proc_free, and instead
put it off into the pr_task. This is similar to prison_free, and in fact
uses the same task even though they do something slightly different.

This resolves a LOR between the process lock and allprison_lock, which
came about in r298565.

PR: 48471


# 298668 26-Apr-2016 jamie

Use crcopysafe in jail_attach.


# 298566 25-Apr-2016 jamie

Pass the current/new jail to PR_METHOD_CHECK, which pushes the call
until after the jail is found or created. This requires unlocking the
jail for the call and re-locking it afterward, but that works because
nothing in the jail has been changed yet, and other processes won't
change the important fields as long as allprison_lock remains held.

Keep better track of name vs namelc in kern_jail_set. Name should
always be the hierarchical name (relative to the caller), and namelc
the last component.

PR: 48471
MFC after: 5 days


# 298565 25-Apr-2016 jamie

Add a new jail OSD method, PR_METHOD_REMOVE. It's called when a jail is
removed from the user perspective, i.e. when the last pr_uref goes away,
even though the jail mail still exist in the dying state. It will also
be called if either PR_METHOD_CREATE or PR_METHOD_SET fail.

PR: 48471
MFC after: 5 days


# 298564 25-Apr-2016 jamie

Remove the PR_REMOVE flag, which was meant as a temporary marker for
a jail that might be seen mid-removal. It hasn't been doing the right
thing since at least the ability to resurrect dying jails, and such
resurrection also makes it unnecessary.


# 298310 19-Apr-2016 pfg

kernel: use our nitems() macro when it is available through param.h.

No functional change, only trivial cases are done in this sweep,

Discussed in: freebsd-current


# 292277 15-Dec-2015 jamie

Fix jail name checking that disallowed anything that starts with '0'.
The intention was to just limit leading zeroes on numeric names. That
check is now improved to also catch the leading spaces and '+' that
strtoul can pass through.

PR: 204897
MFC after: 3 days


# 290857 15-Nov-2015 trasz

Speed up rctl operation with large rulesets, by holding the lock
during iteration instead of relocking it for each traversed rule.

Reviewed by: mjg@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D4110


# 285685 19-Jul-2015 araujo

Add support to the jail framework to be able to mount linsysfs(5) and
linprocfs(5).

Differential Revision: D2846
Submitted by: Nikolai Lifanov <lifanov@mail.lifanov.com>
Reviewed by: jamie


# 285390 11-Jul-2015 mjg

Move chdir/chroot-related fdp manipulation to kern_descrip.c

Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by: kib


# 284514 17-Jun-2015 bz

Initialise pr_enforce_statfs from the "default" sysctl value and
not from the compile time constant. The sysctl value is seeded
from the compile time constant.

MFC after: 2 weeks


# 282213 29-Apr-2015 trasz

Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation


# 280445 24-Mar-2015 glebius

Do not include if_var.h and in6_var.h into kern_jail.c. It is now possible
after r280444.

Sponsored by: Nginx, Inc.


# 280130 15-Mar-2015 mjg

cred: add proc_set_cred helper

The goal here is to provide one place altering process credentials.

This eases debugging and opens up posibilities to do additional work when such
an action is performed.


# 279396 28-Feb-2015 ian

Format the line properly (wrap before column 80).


# 279395 28-Feb-2015 ian

Export the new osreldate and osrelease jail parms in jail_get(2).


# 279361 27-Feb-2015 ian

Allow the kern.osrelease and kern.osreldate sysctl values to be set in a
jail's creation parameters. This allows the kernel version to be reliably
spoofed within the jail whether examined directly with sysctl or
indirectly with the uname -r and -K options.

The values can only be set at jail creation time, to eliminate the need
for any locking when accessing the values via sysctl.

The overridden values are inherited by nested jails (unless the config for
the nested jails also overrides the values).

There is no sanity or range checking, other than disallowing an empty
release string or a zero release date, by design. The system
administrator is trusted to set sane values. Setting values that are
newer than the actual running kernel will likely cause compatibility
problems.

Differential Revision: https://reviews.freebsd.org/D1948
Relnotes: yes


# 277855 28-Jan-2015 jamie

Add allow.mount.fdescfs jail flag.

PR: 192951
Submitted by: ruben@verweg.com
MFC after: 3 days


# 277159 14-Jan-2015 jamie

Remove the prison flags PR_IP4_DISABLE and PR_IP6_DISABLE, which have been
write-only for as long as they've existed.


# 277158 14-Jan-2015 jamie

Don't set prison's pr_ip4s or pr_ip6s to -1.

PR: 196474
MFC after: 3 days


# 271317 09-Sep-2014 trasz

Avoid unlocking unlocked mutex in RCTL jail code. Specific test case
is attached to PR.

PR: 193457
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 263152 14-Mar-2014 glebius

Remove AppleTalk support.

AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.


# 263140 14-Mar-2014 glebius

Remove IPX support.

IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.


# 261326 31-Jan-2014 jamie

Back out r261266 pending security buy-in.

r261266:
Add a jail parameter, allow.kmem, which lets jailed processes access
/dev/kmem and related devices (i.e. grants PRIV_IO and PRIV_KMEM_WRITE).
This in conjunction with changing the drm driver's permission check from
PRIV_DRIVER to PRIV_KMEM_WRITE will allow a jailed Xorg server.


# 261266 29-Jan-2014 jamie

Add a jail parameter, allow.kmem, which lets jailed processes access
/dev/kmem and related devices (i.e. grants PRIV_IO and PRIV_KMEM_WRITE).
This in conjunction with changing the drm driver's permission check from
PRIV_DRIVER to PRIV_KMEM_WRITE will allow a jailed Xorg server.

Submitted by: netchild
MFC after: 1 week


# 259520 17-Dec-2013 ae

Fix copy/paste typo.

MFC after: 1 week


# 258718 28-Nov-2013 peter

jail_v0.ip_number was always in host byte order. This was handled
in one of the many layers of indirection and shims through stable/7
in jail_handle_ips(). When it was cleaned up and unified through
kern_jail() for 8.x, the byte order swap was lost.

This only matters for ancient binaries that call jail(2) themselves
internally.


# 257498 01-Nov-2013 glebius

prison_check_ip4() can take const arguments.


# 257176 26-Oct-2013 glebius

The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 255316 06-Sep-2013 jamie

Keep PRIV_KMEM_READ permitted inside jails as it is on the outside.


# 254741 23-Aug-2013 delphij

Allow tmpfs be mounted inside jail.


# 250804 19-May-2013 jamie

Refine the "nojail" rc keyword, adding "nojailvnet" for files that don't
apply to most jails but do apply to vnet jails. This includes adding
a new sysctl "security.jail.vnet" to identify vnet jails.

PR: conf/149050
Submitted by: mdodd
MFC after: 3 days


# 244404 18-Dec-2012 mjg

prison_racct_detach can be called for not fully initialized jail, so make it check that the jail has racct before doing anything

PR: kern/174436
Reviewed by: trasz
MFC after: 3 days


# 241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 235803 22-May-2012 trasz

Fix use-after-free in kern_jail_set() triggered e.g. by attempts
to clear "persist" flag from empty persistent jail, like this:

jail -c persist=1
jail -n 1 -m persist=0

Submitted by: Mateusz Guzik <mjguzik at gmail dot com>
MFC after: 2 weeks


# 235795 22-May-2012 trasz

Don't leak locks in prison_racct_modify().

Submitted by: Mateusz Guzik <mjguzik at gmail dot com>
MFC after: 2 weeks


# 232598 06-Mar-2012 trasz

Make racct and rctl correctly handle jail renaming. Previously
they would continue using old name, the one jail was created with.

PR: bin/165207


# 232278 28-Feb-2012 mm

Add procfs to jail-mountable filesystems.

Reviewed by: jamie
MFC after: 1 week


# 232186 26-Feb-2012 mm

Analogous to r232059, add a parameter for the ZFS file system:

allow.mount.zfs:
allow mounting the zfs filesystem inside a jail

This way the permssions for mounting all current VFCF_JAIL filesystems
inside a jail are controlled wia allow.mount.* jail parameters.

Update sysctl descriptions.
Update jail(8) and zfs(8) manpages.

TODO: document the connection of allow.mount.* and VFCF_JAIL for kernel
developers

MFC after: 10 days


# 232059 23-Feb-2012 mm

To improve control over the use of mount(8) inside a jail(8), introduce
a new jail parameter node with the following parameters:

allow.mount.devfs:
allow mounting the devfs filesystem inside a jail

allow.mount.nullfs:
allow mounting the nullfs filesystem inside a jail

Both parameters are disabled by default (equals the behavior before
devfs and nullfs in jails). Administrators have to explicitly allow
mounting devfs and nullfs for each jail. The value "-1" of the
devfs_ruleset parameter is removed in favor of the new allow setting.

Reviewed by: jamie
Suggested by: pjd
MFC after: 2 weeks


# 231267 09-Feb-2012 mm

Add support for mounting devfs inside jails.

A new jail(8) option "devfs_ruleset" defines the ruleset enforcement for
mounting devfs inside jails. A value of -1 disables mounting devfs in
jails, a value of zero means no restrictions. Nested jails can only
have mounting devfs disabled or inherit parent's enforcement as jails are
not allowed to view or manipulate devfs(8) rules.

Utilizes new functions introduced in r231265.

Reviewed by: jamie
MFC after: 1 month


# 230407 20-Jan-2012 mm

Use separate buffer for global path to avoid overflow of path buffer.

Reviewed by: jamie@
MFC after: 3 weeks


# 230143 15-Jan-2012 mm

Fix missing in r230129:

kern_jail.c: initialize fullpath_disabled to zero
vfs_cache.c: add missing dot in comment

Reported by: kib
MFC after: 1 month


# 230129 15-Jan-2012 mm

Introduce vn_path_to_global_path()

This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.

In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.

Unbreaks jailed zfs(8) with enforce_statfs set to 1.

Reviewed by: kib
MFC after: 1 month


# 227309 07-Nov-2011 ed

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# 227293 07-Nov-2011 ed

Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.

This means that their use is restricted to a single C file.


# 225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 225191 26-Aug-2011 jamie

Delay the recursive decrement of pr_uref when jails are made invisible
but not removed; decrement it instead when the child jail actually
goes away. This avoids letting the counter go below zero in the case
where dying (pr_uref==0) jails are "resurrected", and an associated
KASSERT panic.

Submitted by: Steven Hartland
Approved by: re (bz)
MFC after: 1 week


# 224615 02-Aug-2011 mm

Always disable mount and unmount for jails with enforce_statfs==2.
A working statfs(2) is required for umount(8) in jail.

Reviewed by: pjd, kib
Approved by: re (kib)
MFC after: 2 weeks


# 224290 24-Jul-2011 mckusick

This update changes the mnt_flag field in the mount structure from
32 bits to 64 bits and eliminates the unused mnt_xflag field. The
existing mnt_flag field is completely out of bits, so this update
gives us room to expand. Note that the f_flags field in the statfs
structure is already 64 bits, so the expanded mnt_flag field can
be exported without having to make any changes in the statfs structure.

Approved by: re (bz)


# 223735 03-Jul-2011 bz

Add infrastructure to allow all frames/packets received on an interface
to be assigned to a non-default FIB instance.

You may need to recompile world or ports due to the change of struct ifnet.

Submitted by: cjsp
Submitted by: Alexander V. Chernikov (melifaro ipfw.ru)
(original versions)
Reviewed by: julian
Reviewed by: Alexander V. Chernikov (melifaro ipfw.ru)
MFC after: 2 weeks
X-MFC: use spare in struct ifnet


# 221362 03-May-2011 trasz

Change the way rctl interfaces with jails by introducing prison_racct
structure, which acts as a proxy between them. This makes jail rules
persistent, i.e. they can be added before jail gets created, and they
don't disappear when the jail gets destroyed.


# 220163 30-Mar-2011 trasz

Add rctl. It's used by racct to take user-configurable actions based
on the set of rules it maintains and the current resource usage. It also
privides userland API to manage that ruleset.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 220137 29-Mar-2011 trasz

Add racct. It's an API to keep per-process, per-jail, per-loginclass
and per-loginclass resource accounting information, to be used by the new
resource limits code. It's connected to the build, but the code that
actually calls the new functions will come later.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 219819 21-Mar-2011 jeff

- Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
and other miscellaneous small features.


# 219304 05-Mar-2011 trasz

Add two new system calls, setloginclass(2) and getloginclass(2). This makes
it possible for the kernel to track login class the process is assigned to,
which is required for RCTL. This change also make setusercontext(3) call
setloginclass(2) and makes it possible to retrieve current login class using
id(1).

Reviewed by: kib (as part of a larger patch)


# 217896 26-Jan-2011 dchagin

Add macro to test the sv_flags of any process. Change some places to test
the flags instead of explicit comparing with address of known sysentvec
structures.

MFC after: 1 month


# 216861 31-Dec-2010 bz

Mfp4 CH177924:

Add and export constants of array sizes of jail parameters as compiled into
the kernel.
This is the least intrusive way to allow kvm to read the (sparse) arrays
independent of the options the kernel was compiled with.

Reviewed by: jhb (originally)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH


# 212436 10-Sep-2010 jamie

Don't exit kern_jail_set without freeing options when enforce_statfs
has an illegal value.

MFC after: 3 days


# 211085 08-Aug-2010 jamie

Back out r210974. Any convenience of not typing "persist" is outweighed
by the possibility of unintended partially-formed jails.


# 210974 06-Aug-2010 jamie

Implicitly make a new jail persistent if it's set not to attach.

MFC after: 3 days


# 208803 04-Jun-2010 cperciva

Declare ip6 as (struct in6_addr *) instead of (struct in_addr *). This is
a harmless bug since we never actually use ip6 as anything other than an
opaque pointer.

Found with: Coverty Prevent(tm)
CID: 4319
MFC after: 1 month


# 205014 11-Mar-2010 nwhitehorn

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by: kib, jhb


# 203052 26-Jan-2010 delphij

Revised revision 199201 (add interface description capability as inspired
by OpenBSD), based on comments from many, including rwatson, jhb, brooks
and others.

Sponsored by: iXsystems, Inc.
MFC after: 1 month


# 202468 17-Jan-2010 bz

Add ip4.saddrsel/ip4.nosaddrsel (and equivalent for ip6) to control
whether to use source address selection (default) or the primary
jail address for unbound outgoing connections.

This is intended to be used by people upgrading from single-IP
jails to multi-IP jails but not having to change firewall rules,
application ACLs, ... but to force their connections (unless
otherwise changed) to the primry jail IP they had been used for
years, as well as for people prefering to implement similar policies.

Note that for IPv6, if configured incorrectly, this might lead to
scope violations, which single-IPv6 jails could as well, as by the
design of jails. [1]

Reviewed by: jamie, hrs (ipv6 part)
Pointed out by: hrs [1]
MFC After: 2 weeks
Asked for by: Jase Thew (bazerka beardz.net)


# 202123 11-Jan-2010 bz

Change DDB show prison:
- name some columns more closely to the user space variables,
as we do for host.* or allow.* (in the listing) already.
- print pr_childmax (children.max).
- prefix hex values with 0x.

MFC after: 3 weeks


# 202116 11-Jan-2010 bz

Adjust a comment to reflect reality, as we have proper source
address selection, even for IPv4, since r183571.

Pointed out by: Jase Thew (bazerka beardz.net)
MFC after: 3 days


# 201145 28-Dec-2009 antoine

(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month


# 200473 13-Dec-2009 bz

Throughout the network stack we have a few places of
if (jailed(cred))
left. If you are running with a vnet (virtual network stack) those will
return true and defer you to classic IP-jails handling and thus things
will be "denied" or returned with an error.

Work around this problem by introducing another "jailed()" function,
jailed_without_vnet(), that also takes vnets into account, and permits
the calls, should the jail from the given cred have its own virtual
network stack.

We cannot change the classic jailed() call to do that, as it is used
outside the network stack as well.

Discussed with: julian, zec, jamie, rwatson (back in Sept)
MFC after: 5 days


# 199231 12-Nov-2009 delphij

Revert revision 199201 for now as it has introduced a kernel vulnerability
and requires more polishing.


# 199201 11-Nov-2009 delphij

Add interface description capability as inspired by OpenBSD.

MFC after: 3 months


# 196970 08-Sep-2009 phk

Revert previous commit and add myself to the list of people who should
know better than to commit with a cat in the area.


# 196969 08-Sep-2009 phk

Add necessary include.


# 196835 04-Sep-2009 jamie

Allow a jail's name to be the same as its jid (which is the default if no
name is specified), but still disallow other numeric names.

Reviewed by: zec
Approved by: bz (mentor)
MFC after: 3 days


# 196592 27-Aug-2009 jamie

Fix a LOR between allprison_lock and vnode locks by releasing
allprison_lock before releasing a prison's root vnode.

PR: kern/138004
Reviewed by: kib
Approved by: bz (mentor)
MFC after: 3 days


# 196505 24-Aug-2009 zec

When "jail -c vnet" request fails, the current code actually creates and
leaves behind an orphaned vnet. This change ensures that such vnets get
released.

This change affects only options VIMAGE builds.

Submitted by: jamie
Discussed with: bz
Approved by: re (rwatson), julian (mentor)
MFC after: 3 days


# 196176 13-Aug-2009 bz

Make it possible to change the vnet sysctl variables on jails
with their own virtual network stack. Jails only inheriting a
network stack cannot change anything that cannot be changed from
within a prison.

Reviewed by: rwatson, zec
Approved by: re (kib)


# 196135 12-Aug-2009 bz

Make the kernel compile without IP networking by moving
a variable under a proper #ifdef.

Approved by: re (rwatson)


# 196019 01-Aug-2009 rwatson

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 196002 31-Jul-2009 jamie

Make the "enforce_statfs" default 2 (most restrictive) in jail_set(2),
instead of whatever the parent/system has (which is generally 0). This
mirrors the old-style default used for jail(2) in conjunction with the
security.jail.enforce_statfs sysctl.

Approved by: re (kib), bz (mentor)


# 195974 30-Jul-2009 jamie

Remove a LOR, where the the sleepable allprison_lock was being obtained
in prison_equal_ip4/6 while an inp mutex was held. Locking allprison_lock
can be avoided by making a restriction on the IP addresses associated with
jails:

Don't allow the "ip4" and "ip6" parameters to be changed after a jail is
created. Setting the "ip4.addr" and "ip6.addr" parameters is allowed,
but only if the jail was already created with either ip4/6=new or
ip4/6=disable. With this restriction, the prison flags in question
(PR_IP4_USER and PR_IP6_USER) become read-only and can be checked
without locking.

This also allows the simplification of a messy code path that was needed
to handle an existing prison gaining an IP address list.

PR: kern/136899
Reported by: Dirk Meyer
Approved by: re (kib), bz (mentor)


# 195945 29-Jul-2009 jamie

Don't allow mixing the "vnet" and "ip4/6" jail parameters, since vnet
jails have their own IP stack and don't have access to the parent IP
addresses anyway. Note that a virtual network stack forms a break
between prisons with regard to the list of allowed IP addresses.

Approved by: re (kib), bz (mentor)


# 195944 29-Jul-2009 jamie

Change the default value of the "ip4" and "ip6" jail parameters to
"disable", which only allows access to the parent/physical system's
IP addresses when specifically directed. Change the default value of
"host" to "new", and don't copy the parent host values, to insulate
jails from the parent hostname et al.

Approved by: re (kib), bz (mentor)


# 195870 25-Jul-2009 jamie

Some jail parameters (in particular, "ip4" and "ip6" for IP address
restrictions) were found to be inadequately described by a boolean.
Define a new parameter type with three values (disable, new, inherit)
to handle these and future cases.

Approved by: re (kib), bz (mentor)
Discussed with: rwatson


# 195741 17-Jul-2009 jamie

Remove the interim vimage containers, struct vimage and struct procg,
and the ioctl-based interface that supported them.

Approved by: re (kib), bz (mentor)


# 194923 24-Jun-2009 jamie

Wrap a PR_VNET inside "#ifdef VIMAGE" since that the only place it applies.
bz wants the blame for this.

Noticed by: rwatson
Approved by: bz (mentor)


# 194915 24-Jun-2009 jamie

In case of prisons with their own network stack, permit
additional privileges as well as not restricting the type of
sockets a user can open.

Note: the VIMAGE/vnet fetaure of of jails is still considered
experimental and cannot guarantee that privileged users
can be kept imprisoned if enabled.

Reviewed by: rwatson
Approved by: bz (mentor)


# 194762 23-Jun-2009 jamie

Add a limit for child jails via the "children.cur" and "children.max"
parameters. This replaces the simple "allow.jails" permission.

Approved by: bz (mentor)


# 194251 15-Jun-2009 jamie

Manage vnets via the jail system. If a jail is given the boolean
parameter "vnet" when it is created, a new vnet instance will be created
along with the jail. Networks interfaces can be moved between prisons
with an ioctl similar to the one that moves them between vimages.
For now vnets will co-exist under both jails and vimages, but soon
struct vimage will be going away.

Reviewed by: zec, julian
Approved by: bz (mentor)


# 194118 13-Jun-2009 jamie

Rename the host-related prison fields to be the same as the host.*
parameters they represent, and the variables they replaced, instead of
abbreviated versions of them.

Approved by: bz (mentor)


# 194090 12-Jun-2009 jamie

Add counterparts to getcredhostname:
getcreddomainname, getcredhostuuid, getcredhostid

Suggested by: rmacklem
Approved by: bz


# 193865 09-Jun-2009 jamie

Fix some overflow errors: a signed allocation and an insufficiant
array size.

Reported by: pho
Tested by: pho
Approved by: bz (mentor)


# 193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 193066 29-May-2009 jamie

Place hostnames and similar information fully under the prison system.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex. Jails may
have their own host information, or they may inherit it from the
parent/system. The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL. The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.

The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.

Approved by: bz (mentor)


# 192895 27-May-2009 jamie

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 192644 23-May-2009 jamie

Delay an error message until the variable it uses gets initialized.

Found with: Coverity Prevent(tm)
CID: 4316
Reported by: trasz
Approved by: bz (mentor)


# 191915 08-May-2009 zec

Introduce a new virtualization container, provisionally named vprocg, to hold
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg. Struct vprocg is likely to become replaced in the near future with
a new jail management API import.

As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.

Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.

Permit kldload / kldunload operations to be executed only from the default
vimage context.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by: bz
Approved by: julian (mentor)


# 191896 07-May-2009 jamie

Move the per-prison Linux MIB from a private one-off pointer to the new
OSD-based jail extensions. This allows the Linux MIB to accessed via
jail_set and jail_get, and serves as a demonstration of adding jail support
to a module.

Reviewed by: dchagin, kib
Approved by: bz (mentor)


# 191673 29-Apr-2009 jamie

Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by: bz (mentor)


# 191671 29-Apr-2009 jamie

Some non-functional changes: whitespace, KASSERT strings, declaration order.

Approved by: bz (mentor)


# 190466 27-Mar-2009 jamie

Whitespace/spelling fixes in advance of upcoming functional changes.

Approved by: bz (mentor)


# 188146 05-Feb-2009 jamie

Don't allow creating a socket with a protocol family that the current
jail doesn't support. This involves a new function prison_check_af,
like prison_check_ip[46] but that checks only the family.

With this change, most of the errors generated by jailed sockets
shouldn't ever occur, at least until jails are changeable.

Approved by: bz (mentor)


# 188144 05-Feb-2009 jamie

Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise. The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family. For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes). Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by: bz (mentor)


# 187864 28-Jan-2009 ed

Mark most often used sysctl's as MPSAFE.

After running a `make buildkernel', I noticed most of the Giant locks in
sysctl are only caused by a very small amount of sysctl's:

- sysctl.name2oid. This one is locked by SYSCTL_LOCK, just like
sysctl.oidfmt.

- kern.ident, kern.osrelease, kern.version, etc. These are just constant
strings.

- kern.arandom, used by the stack protector. It is already protected by
arc4_mtx.

I also saw the following sysctl's show up. Not as often as the ones
above, but still quite often:

- security.jail.jailed. Also mark security.jail.list as MPSAFE. They
don't need locking or already use allprison_lock.

- kern.devname, used by devname(3), ttyname(3), etc.

This seems to reduce Giant locking inside sysctl by ~75% in my primitive
test setup.


# 187684 25-Jan-2009 bz

For consistency with prison_{local,remote,check}_ipN rename
prison_getipN to prison_get_ipN.

Submitted by: jamie (as part of a larger patch)
MFC after: 1 week


# 186736 04-Jan-2009 bz

Back out r186615; the sanitizing of the pointers in the error case
is not needed and seems that it will not be needed either.

Pointy hat: mine, mine, mine and not pho's


# 186615 30-Dec-2008 pho

Added missing second part of cleaning j->ip[46] as requested by bz

Approved by: kib (mentor)
Pointy hat: pho


# 186606 30-Dec-2008 pho

Make sure that unused j->ip[46] are cleared

Reviewed by: bz
Approved by: kib (mentor)


# 185899 10-Dec-2008 bz

Correctly check the number of prison states to not access anything
outside the prison_states array.
When checking if there is a name configured for the prison, check the
first character to not be '\0' instead of checking if the char array
is present, which it always is. Note, that this is different for the
*jailname in the syscall.

Found with: Coverity Prevent(tm)
CID: 4156, 4155
MFC after: 4 weeks (just that I get the mail)


# 185441 29-Nov-2008 bz

Unbreak the no-networks (no INET/6) build that I broke with
the commit in r185435.

Pointyhat: no, but I could need a ski cap for the winter


# 185435 29-Nov-2008 bz

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# 185404 28-Nov-2008 bz

With the permissions of phk@ change the license on kern_jail.c
to a 2 clause BSD license.


# 185029 17-Nov-2008 pjd

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


# 184205 23-Oct-2008 des

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# 183550 02-Oct-2008 zec

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 181803 17-Aug-2008 bz

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 180357 07-Jul-2008 bz

MFp4 144659:
Plug a memory leak with jail services.

PR: 125257
Submitted by: Mateusz Guzik <mjguzik gmail.com>
MFC after: 6 days


# 180291 05-Jul-2008 rwatson

Introduce a new lock, hostname_mtx, and use it to synchronize access
to global hostname and domainname variables. Where necessary, copy
to or from a stack-local buffer before performing copyin() or
copyout(). A few uses, such as in cd9660 and daemon_saver, remain
under-synchronized and will require further updates.

Correct a bug in which a failed copyin() of domainname would leave
domainname potentially corrupted.

MFC after: 3 weeks


# 179881 19-Jun-2008 delphij

Revert rev. 178124 as requested by kris@. Having jail id not being
reused too frequently is useful for script controlled environment.


# 178124 11-Apr-2008 delphij

Instead of rolling our own jail number allocation procedure, use
alloc_unr() to do it.

Submitted by: Ed Schouten <ed 80386 nl>
PR: kern/122270
MFC after: 1 month


# 177785 31-Mar-2008 kib

Add the support for the AT_FDCWD and fd-relative name lookups to the
namei(9).

Based on the submission by rdivacky,
sponsored by Google Summer of Code 2007
Reviewed by: rwatson, rdivacky
Tested by: pho


# 175630 24-Jan-2008 bz

Replace the last susers calls in netinet6/ with privilege checks.

Introduce a new privilege allowing to set certain IP header options
(hop-by-hop, routing headers).

Leave a few comments to be addressed later.

Reviewed by: rwatson (older version, before addressing his comments)


# 175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# 175202 09-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 172860 21-Oct-2007 rwatson

Add PRIV_VFS_STAT privilege, which will allow overriding policy limits on
the right to stat() a file, such as in mac_bsdextended.

Obtained from: TrustedBSD Project
MFC after: 3 months


# 168699 13-Apr-2007 pjd

Fix jails and jail-friendly file systems handling:
- We need to allow for PRIV_VFS_MOUNT_OWNER inside a jail.
- Move security checks to vfs_suser() and deny unmounting and updating
for jailed root from different jails, etc.

OK'ed by: rwatson


# 168591 10-Apr-2007 rwatson

Allow PRIV_NETINET_REUSEPORT in jail.


# 168489 08-Apr-2007 pjd

prison_free() can be called with a mutex held. This wasn't a problem until
I converted allprison_mtx mutex to allprison_lock sx lock. To fix this LOR,
move prison removal to prison_complete() entirely. To ensure that noone
will reference this prison before it's beeing removed from the list skip
prisons with 'pr_ref == 0' in prison_find() and assert that pr_ref has to
greater than 0 in prison_hold().

Reported by: kris
OK'ed by: rwatson


# 168487 08-Apr-2007 pjd

Only use prison mutex to protect the fields that need to be protected by it.


# 168483 08-Apr-2007 pjd

pr_list is protected by the allprison_lock.


# 168401 05-Apr-2007 pjd

Implement functionality I called 'jail services'.

It may be used for external modules to attach some data to jail's in-kernel
structure.

- Change allprison_mtx mutex to allprison_sx sx(9) lock.
We will need to call external functions while holding this lock, which may
want to allocate memory.
Make use of the fact that this is shared-exclusive lock and use shared
version when possible.
- Implement the following functions:
prison_service_register() - registers a service that wants to be noticed
when a jail is created and destroyed
prison_service_deregister() - deregisters service
prison_service_data_add() - adds service-specific data to the jail structure
prison_service_data_get() - takes service-specific data from the jail
structure
prison_service_data_del() - removes service-specific data from the jail
structure

Reviewed by: rwatson


# 168399 05-Apr-2007 pjd

Make prison_find() globally accessible.


# 168396 05-Apr-2007 pjd

Add security.jail.mount_allowed sysctl, which allows to mount and
unmount jail-friendly file systems from within a jail.
Precisely it grants PRIV_VFS_MOUNT, PRIV_VFS_UNMOUNT and
PRIV_VFS_MOUNT_NONUSER privileges for a jailed super-user.
It is turned off by default.

A jail-friendly file system is a file system which driver registers
itself with VFCF_JAIL flag via VFS_SET(9) API.
The lsvfs(1) command can be used to see which file systems are
jail-friendly ones.

There currently no jail-friendly file systems, ZFS will be the first one.
In the future we may consider marking file systems like nullfs as
jail-friendly.

Reviewed by: rwatson


# 167354 09-Mar-2007 pjd

Minor simplification.


# 167309 07-Mar-2007 pjd

White space nits.


# 167211 04-Mar-2007 rwatson

Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.


# 167152 01-Mar-2007 pjd

Rename PRIV_VFS_CLEARSUGID to PRIV_VFS_RETAINSUGID, which seems to better
describe the privilege.

OK'ed by: rwatson


# 166838 19-Feb-2007 rwatson

Remove unused PRIV_IPC_EXEC. Renumbers System V IPC privilege.


# 166832 19-Feb-2007 rwatson

Rename three quota privileges from the UFS privilege namespace to the
VFS privilege namespace: exceedquota, getquota, and setquota. Leave
UFS-specific quota configuration privileges in the UFS name space.

This renumbers VFS and UFS privileges, so requires rebuilding modules
if you are using security policies aware of privilege identifiers.
This is likely no one at this point since none of the committed MAC
policies use the privilege checks.


# 166831 19-Feb-2007 rwatson

Limit quota privileges in jail to PRIV_UFS_GETQUOTA and
PRIV_UFS_SETQUOTA.


# 166827 19-Feb-2007 rwatson

For now, reflect practical reality that Audit system calls aren't
allowed in Jail: return a privilege error.


# 164032 06-Nov-2006 rwatson

Add a new priv(9) kernel interface for checking the availability of
privilege for threads and credentials. Unlike the existing suser(9)
interface, priv(9) exposes a named privilege identifier to the privilege
checking code, allowing more complex policies regarding the granting of
privilege to be expressed. Two interfaces are provided, replacing the
existing suser(9) interface:

suser(td) -> priv_check(td, priv)
suser_cred(cred, flags) -> priv_check_cred(cred, priv, flags)

A comprehensive list of currently available kernel privileges may be
found in priv.h. New privileges are easily added as required, but the
comments on adding privileges found in priv.h and priv(9) should be read
before doing so.

The new privilege interface exposed sufficient information to the
privilege checking routine that it will now be possible for jail to
determine whether a particular privilege is granted in the check routine,
rather than relying on hints from the calling context via the
SUSER_ALLOWJAIL flag. For now, the flag is maintained, but a new jail
check function, prison_priv_check(), is exposed from kern_jail.c and used
by the privilege check routine to determine if the privilege is permitted
in jail. As a result, a centralized list of privileges permitted in jail
is now present in kern_jail.c.

The MAC Framework is now also able to instrument privilege checks, both
to deny privileges otherwise granted (mac_priv_check()), and to grant
privileges otherwise denied (mac_priv_grant()), permitting MAC Policy
modules to implement privilege models, as well as control a much broader
range of system behavior in order to constrain processes running with
root privilege.

The suser() and suser_cred() functions remain implemented, now in terms
of priv_check() and the PRIV_ROOT privilege, for use during the transition
and possibly continuing use by third party kernel modules that have not
been updated. The PRIV_DRIVER privilege exists to allow device drivers to
check privilege without adopting a more specific privilege identifier.

This change does not modify the actual security policy, rather, it
modifies the interface for privilege checks so changes to the security
policy become more feasible.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 162383 17-Sep-2006 rwatson

Declare security and security.bsd sysctl hierarchies in sysctl.h along
with other commonly used sysctl name spaces, rather than declaring them
all over the place.

MFC after: 1 month
Sponsored by: nCircle Network Security, Inc.


# 150652 27-Sep-2005 csjp

Push Giant down in jails. Pass the MPSAFE flag to NDINIT, and keep track
of whether or not Giant was picked up by the filesystem. Add VFS_LOCK_GIANT
macros around vrele as it's possible that this can call in the VOP_INACTIVE
filesystem specific code. Also while we are here, remove the Giant assertion.
from the sysctl handler, we do not actually require Giant here so we
shouldn't assert it. Doing so will just complicate things when Giant is removed
from the sysctl framework.


# 147559 23-Jun-2005 pjd

Actually only protect mount-point if security.jail.enforce_statfs is set to 2.
If we don't return statistics about requested file systems, system tools
may not work correctly or at all.

Approved by: re (scottl)


# 147185 09-Jun-2005 pjd

Rename sysctl security.jail.getfsstatroot_only to security.jail.enforce_statfs
and extend its functionality:

value policy
0 show all mount-points without any restrictions
1 show only mount-points below jail's chroot and show only part of the
mount-point's path (if jail's chroot directory is /jails/foo and
mount-point is /jails/foo/usr/home only /usr/home will be shown)
2 show only mount-point where jail's chroot directory is placed.

Default value is 2.

Discussed with: rwatson


# 144660 05-Apr-2005 jeff

- Use taskqueue_thread rather than taskqueue_swi since our task is going
to vrele, which may vop lock. This is not safe in a software interrupt
context.


# 144442 31-Mar-2005 jhb

Drop a bogus mp_fixme(). Adding a lock would do nothing to reduce userland
races regarding changing of jail-related sysctls.


# 141543 08-Feb-2005 cperciva

Add a new sysctl, "security.jail.chflags_allowed", which controls the
behaviour of chflags within a jail. If set to 0 (the default), then a
jailed root user is treated as an unprivileged user; if set to 1, then
a jailed root user is treated the same as an unjailed root user.

This is necessary to allow "make installworld" to work inside a jail,
since it attempts to manipulate the system immutable flag on certain
files.

Discussed with: csjp, rwatson
MFC after: 2 weeks


# 139804 06-Jan-2005 imp

/* -> /*- for copyright notices, minor format tweaks as necessary


# 131177 27-Jun-2004 pjd

Add two missing includes and remove two uneeded.
This is quite serious fix, because even with MAC framework compiled in,
MAC entry points in those two files were simply ignored.


# 129462 20-May-2004 pjd

Fix sysctl name: security.jail.getfsstate_getfsstatroot_only ->
security.jail.getfsstatroot_only.

Approved by: rwatson


# 128664 26-Apr-2004 bmilekic

Give jail(8) the feature to allow raw sockets from within a
jail, which is less restrictive but allows for more flexible
jail usage (for those who are willing to make the sacrifice).
The default is off, but allowing raw sockets within jails can
now be accomplished by tuning security.jail.allow_raw_sockets
to 1.

Turning this on will allow you to use things like ping(8)
or traceroute(8) from within a jail.

The patch being committed is not identical to the patch
in the PR. The committed version is more friendly to
APIs which pjd is working on, so it should integrate
into his work quite nicely. This change has also been
presented and addressed on the freebsd-hackers mailing
list.

Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
PR: kern/65800


# 127020 15-Mar-2004 pjd

Remove sysctl security.jail.list_allowed.
This functionality was a misfeature, sysctl was added and turned off by
default just to check if nobody complains.

Reviewed by: rwatson


# 126023 19-Feb-2004 nectar

Rework jail_attach(2) so that an already jailed process cannot hop
to another jail.

Submitted by: rwatson


# 126004 19-Feb-2004 pjd

Added sysctl security.jail.jailed.
It returns 1 is process is inside of jail and 0 if it is not.
Information if we are in jail or not is not a secret, there is plenty of
ways to discover it. Many people are using own hack to check this and
this will be a legal way from now on.

It will be great if our starting scripts will take advantage of this sysctl
to allow clean "boot" inside jail.

Approved by: rwatson, scottl (mentor)


# 125806 14-Feb-2004 rwatson

By default, don't allow processes in a jail to list the set of
jails in the system. Previous behavior (allowed) may be restored
by setting security.jail.list_allowed=1.


# 125805 14-Feb-2004 rwatson

Fix mismerge in last commit: check that cred->cr_prison is NULL
before dereferencing the prison pointer.


# 125804 14-Feb-2004 rwatson

By default, when a process in jail calls getfsstat(), only return the
data for the file system on which the jail's root vnode is located.
Previous behavior (show data for all mountpoints) can be restored
by setting security.jail.getfsstatroot_only to 0. Note: this also
has the effect of hiding other mounts inside a jail, such as /dev,
/tmp, and /proc, but errs on the side of leaking less information.


# 124882 23-Jan-2004 rwatson

Defer the vrele() on a jail's root vnode reference from prison_free()
to a new prison_complete() task run by a task queue. This removes
a requirement for grabbing Giant in crfree(). Embed the 'struct task'
in 'struct prison' so that we don't have to allocate memory from
prison_free() (which means we also defer the FREE()).

With this change, I believe grabbing Giant from crfree() can now be
removed, but need to check the uidinfo code paths.

To avoid header pollution, move the definition of 'struct task'
to _task.h, and recursively include from taskqueue.h and jail.h; much
preferably to all files including jail.h picking up a requirement to
include taskqueue.h.

Bumped into by: sam
Reviewed by: bde, tjr


# 116182 10-Jun-2003 obrien

Use __FBSDID().


# 114168 28-Apr-2003 mike

style(9)


# 113630 17-Apr-2003 jhb

- The prison mutex cannot possibly protect pointers to the prison it
protects, so don't bother locking it while we assign it to a ucred's
cr_prison.
- Fully construct the new credential for a process before assigning it to
p_ucred.


# 113275 09-Apr-2003 mike

o In struct prison, add an allprison linked list of prisons (protected
by allprison_mtx), a unique prison/jail identifier field, two path
fields (pr_path for reporting and pr_root vnode instance) to store
the chroot() point of each jail.
o Add jail_attach(2) to allow a process to bind to an existing jail.
o Add change_root() to perform the chroot operation on a specified
vnode.
o Generalize change_dir() to accept a vnode, and move namei() calls
to callers of change_dir().
o Add a new sysctl (security.jail.list) which is a group of
struct xprison instances that represent a snapshot of active jails.

Reviewed by: rwatson, tjr


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 108128 20-Dec-2002 mux

Don't forget to destroy the mutex if an error occurs
in the jail() system call.

Submitted by: Pawel Jakub Dawidek <nick@garage.freebsd.pl>


# 107850 14-Dec-2002 alfred

remove syscallarg().

Suggested by: peter


# 105354 17-Oct-2002 robert

Use strlcpy() instead of strncpy() to copy NUL terminated strings
for safety and consistency.


# 99227 01-Jul-2002 iedowse

The jail syscall calls chroot, which is not mpsafe, so put back a
mtx_lock(&Giant) around that call.

Reviewed by: arr


# 98832 25-Jun-2002 arr

- Alleviate jail() from having the burden of acquiring Giant by simply
removing. We can do this since we no longer need Giant to safely
execute jail().

Reviewed by: rwatson, jhb


# 93818 04-Apr-2002 jhb

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 91391 27-Feb-2002 robert

Make getcredhostname() take a buffer and the buffer's size
as arguments. The correct hostname is copied into the buffer
while having the prison's lock acquired in a jailed process'
case.

Reviewed by: jhb, rwatson


# 91384 27-Feb-2002 robert

Add a function which returns the correct hostname for a given
credential.

Reviewed by: phk


# 89414 16-Jan-2002 arr

- Attempt to help declutter kern. sysctl by moving security out from
beneath it.

Reviewed by: rwatson


# 87716 12-Dec-2001 arr

- Move _jail sysctl node underneath _kern_security in order to standardize
where our security related sysctl tuneables are located. Also, this
will help if/when we move _security node out from under _kern as to help
make _kern less cluttered.

Approved by: rwatson
Review by: rwatson


# 87275 03-Dec-2001 rwatson

o Introduce pr_mtx into struct prison, providing protection for the
mutable contents of struct prison (hostname, securelevel, refcount,
pr_linux, ...)
o Generally introduce mtx_lock()/mtx_unlock() calls throughout kern/
so as to enforce these protections, in particular, in kern_mib.c
protection sysctl access to the hostname and securelevel, as well as
kern_prot.c access to the securelevel for access control purposes.
o Rewrite linux emulator abstractions for accessing per-jail linux
mib entries (osname, osrelease, osversion) so that they don't return
a pointer to the text in the struct linux_prison, rather, a copy
to an array passed into the calls. Likewise, update linprocfs to
use these primitives.
o Update in_pcb.c to always use prison_getip() rather than directly
accessing struct prison.

Reviewed by: jhb


# 85844 01-Nov-2001 rwatson

o Move suser() calls in kern/ to using suser_xxx() with an explicit
credential selection, rather than reference via a thread or process
pointer. This is part of a gradual migration to suser() accepting
a struct ucred instead of a struct proc, simplifying the reference
and locking semantics of suser().

Obtained from: TrustedBSD Project


# 84828 11-Oct-2001 jhb

- Catch up to the new ucred API.
- Add proc locking to the jail() syscall. This mostly involved shuffling
a few things around so that blockable things like malloc and copyin
were performed before acquiring the lock and checking the existing
ucred and then updating the ucred as one "atomic" change under the proc
lock.


# 83989 26-Sep-2001 rwatson

o Initialize per-jail securelevel from global securelevel as part of
jail creation.

Obtained from: TrustedBSD Project


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 82710 01-Sep-2001 dillon

Pushdown Giant for acct(), kqueue(), kevent(), execve(), fork(),
vfork(), rfork(), jail().


# 81114 03-Aug-2001 rwatson

Anton kindly pointed out (and fixed) a bug in the Jail handling of the
bind() call on IPv4 sockets:

Currently, if one tries to bind a socket using INADDR_LOOPBACK inside a
jail, it will fail because prison_ip() does not take this possibility
into account. On the other hand, when one tries to connect(), for
example, to localhost, prison_remote_ip() will silently convert
INADDR_LOOPBACK to the jail's IP address. Therefore, it is desirable to
make bind() to do this implicit conversion as well.

Apart from this, the patch also replaces 0x7f000001 in
prison_remote_ip() to a more correct INADDR_LOOPBACK.

This is a 4.4-RELEASE "during the freeze, thanks" MFC candidate.

Submitted by: Anton Berezin <tobez@FreeBSD.org>
Discussed with at some point: phk
MFC after: 3 days


# 72786 21-Feb-2001 rwatson

o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.

Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project


# 69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 68024 30-Oct-2000 rwatson

o Deny access to System V IPC from within jail by default, as in the
current implementation, jail neither virtualizes the Sys V IPC namespace,
nor provides inter-jail protections on IPC objects.
o Support for System V IPC can be enabled by setting jail.sysvipc_allowed=1
using sysctl.
o This is not the "real fix" which involves virtualizing the System V
IPC namespace, but prevents processes within jail from influencing those
outside of jail when not approved by the administrator.

Reported by: Paulo Fragoso <paulo@nlink.com.br>


# 61235 04-Jun-2000 rwatson

o Modify jail to limit creation of sockets to UNIX domain sockets,
TCP/IP (v4) sockets, and routing sockets. Previously, interaction
with IPv6 was not well-defined, and might be inappropriate for some
environments. Similarly, sysctl MIB entries providing interface
information also give out only addresses from those protocol domains.

For the time being, this functionality is enabled by default, and
toggleable using the sysctl variable jail.socket_unixiproute_only.
In the future, protocol domains will be able to determine whether or
not they are ``jail aware''.

o Further limitations on process use of getpriority() and setpriority()
by jailed processes. Addresses problem described in kern/17878.

Reviewed by: phk, jmg


# 57163 12-Feb-2000 rwatson

Yet-another-update: rename ``kern.prison'' to a new sysctl root entry,
``jail'', and move the set_hostname_allowed sysctl there, as well as
fixing a bug in the sysctl that resulted in jails being over-limited
(preventing them from reading as well as writing the hostname). Also,
correct some formatting issues, courtesy bde :-).

Reviewed by: phk
Approved by: jkh


# 51398 19-Sep-1999 phk

Add a version number field to the jail(2) argument so that future changes
can be handled intelligently.


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 46197 30-Apr-1999 phk

Add beer-ware license and $Id$

Noticed by: dillon


# 46194 30-Apr-1999 phk

Make BOOTP to work again.

Submitted by: dillon
Reviewed by: phk


# 46155 28-Apr-1999 phk

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/