History log of /linux-master/include/linux/capability.h
Revision Date Author Comments
# f122a08b 28-Feb-2023 Linus Torvalds <torvalds@linux-foundation.org>

capability: just use a 'u64' instead of a 'u32[2]' array

Back in 2008 we extended the capability bits from 32 to 64, and we did
it by extending the single 32-bit capability word from one word to an
array of two words. It was then obfuscated by hiding the "2" behind two
macro expansions, with the reasoning being that maybe it gets extended
further some day.

That reasoning may have been valid at the time, but the last thing we
want to do is to extend the capability set any more. And the array of
values not only causes source code oddities (with loops to deal with
it), but also results in worse code generation. It's a lose-lose
situation.

So just change the 'u32[2]' into a 'u64' and be done with it.

We still have to deal with the fact that the user space interface is
designed around an array of these 32-bit values, but that was the case
before too, since the array layouts were different (ie user space
doesn't use an array of 32-bit values for individual capability masks,
but an array of 32-bit slices of multiple masks).

So that marshalling of data is actually simplified too, even if it does
remain somewhat obscure and odd.

This was all triggered by my reaction to the new "cap_isidentical()"
introduced recently. By just using a saner data structure, it went from

unsigned __capi;
CAP_FOR_EACH_U32(__capi) {
if (a.cap[__capi] != b.cap[__capi])
return false;
}
return true;

to just being

return a.val == b.val;

instead. Which is rather more obvious both to humans and to compilers.

Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a4eecbae 25-Jan-2023 Mateusz Guzik <mjguzik@gmail.com>

capability: add cap_isidentical

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9452e93e 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port privilege checking helpers to mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 39f60c1c 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port xattr to mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 71bc356f 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

commoncap: handle idmapped mounts

When interacting with user namespace and non-user namespace aware
filesystem capabilities the vfs will perform various security checks to
determine whether or not the filesystem capabilities can be used by the
caller, whether they need to be removed and so on. The main
infrastructure for this resides in the capability codepaths but they are
called through the LSM security infrastructure even though they are not
technically an LSM or optional. This extends the existing security hooks
security_inode_removexattr(), security_inode_killpriv(),
security_inode_getsecurity() to pass down the mount's user namespace and
makes them aware of idmapped mounts.

In order to actually get filesystem capabilities from disk the
capability infrastructure exposes the get_vfs_caps_from_disk() helper.
For user namespace aware filesystem capabilities a root uid is stored
alongside the capabilities.

In order to determine whether the caller can make use of the filesystem
capability or whether it needs to be ignored it is translated according
to the superblock's user namespace. If it can be translated to uid 0
according to that id mapping the caller can use the filesystem
capabilities stored on disk. If we are accessing the inode that holds
the filesystem capabilities through an idmapped mount we map the root
uid according to the mount's user namespace. Afterwards the checks are
identical to non-idmapped mounts: reading filesystem caps from disk
enforces that the root uid associated with the filesystem capability
must have a mapping in the superblock's user namespace and that the
caller is either in the same user namespace or is a descendant of the
superblock's user namespace. For filesystems that are mountable inside
user namespace the caller can just mount the filesystem and won't
usually need to idmap it. If they do want to idmap it they can create an
idmapped mount and mark it with a user namespace they created and which
is thus a descendant of s_user_ns. For filesystems that are not
mountable inside user namespaces the descendant rule is trivially true
because the s_user_ns will be the initial user namespace.

If the initial user namespace is passed nothing changes so non-idmapped
mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# e65ce2a5 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

acl: handle idmapped mounts

The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.

The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.

In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.

Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 0558c1bf 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

capability: handle idmapped mounts

In order to determine whether a caller holds privilege over a given
inode the capability framework exposes the two helpers
privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former
verifies that the inode has a mapping in the caller's user namespace and
the latter additionally verifies that the caller has the requested
capability in their current user namespace.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped inodes. If the initial user namespace is passed all
operations are a nop so non-idmapped mounts will not see a change in
behavior.

Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 7c03e2cd 14-Dec-2020 Miklos Szeredi <mszeredi@redhat.com>

vfs: move cap_convert_nscap() call into vfs_setxattr()

cap_convert_nscap() does permission checking as well as conversion of the
xattr value conditionally based on fs's user-ns.

This is needed by overlayfs and probably other layered fs (ecryptfs) and is
what vfs_foo() is supposed to do anyway.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>


# 124ea650 18-Jul-2020 Adrian Reber <areber@redhat.com>

capabilities: Introduce CAP_CHECKPOINT_RESTORE

This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has
been asked numerous times if it is possible to checkpoint/restore a
process as non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID
of the restored process. This feature available via the clone3 system
call, or via /proc/sys/kernel/ns_last_pid is unfortunately guarded by
CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have
increased due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a
resource manager distributing jobs where users are always running as
non-root. There is a desire to provide a way to checkpoint and
restore long running jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
CRIU into a Java VM to decrease the startup time. These
checkpoint/restore applications are not meant to be running with
CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
Unfortunately, this helper delegation technique is impossible to use
with clone3, and is thus prone to races.
See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
This has been demonstrated to work with cycling rates of 100,000 PIDs/s
See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited
use in typical container environments (e.g., Kubernetes) as /proc is
typically protected with read-only layers (e.g., /proc/sys) for
hardening purposes. Read-only layers prevent additional /proc mounts
(due to proc's SB_I_USERNS_VISIBLE property), making the use of new
PID namespaces limited as certain applications need access to /proc
matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for
recovering files that are unreachable via the file system such as
deleted files, or memfd files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-2-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# a17b53c4 13-May-2020 Alexei Starovoitov <ast@kernel.org>

bpf, capability: Introduce CAP_BPF

Split BPF operations that are allowed under CAP_SYS_ADMIN into
combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN.
For backward compatibility include them in CAP_SYS_ADMIN as well.

The end result provides simple safety model for applications that use BPF:
- to load tracing program types
BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
use CAP_BPF and CAP_PERFMON
- to load networking program types
BPF_PROG_TYPE_{SCHED_CLS, XDP, SK_SKB, etc}
use CAP_BPF and CAP_NET_ADMIN

There are few exceptions from this rule:
- bpf_trace_printk() is allowed in networking programs, but it's using
tracing mechanism, hence this helper needs additional CAP_PERFMON
if networking program is using this helper.
- BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
to discourage production use.
- BPF HW offload is allowed under CAP_SYS_ADMIN.
- bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only.

CAPs are not checked at attach/detach time with two exceptions:
- loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users,
hence CAP_NET_ADMIN is required at attach time.
- flow_dissector detach doesn't check prog FD at detach,
hence CAP_NET_ADMIN is required at detach time.

CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id
command and convert them to file descriptor via GET_FD_BY_ID command.
This restriction guarantees that mutliple tasks with CAP_BPF are not able to
affect each other. That leads to clean isolation of tasks. For example:
task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link.
task B with the same capabilities cannot detach that firewall unless
task A explicitly passed link FD to task B via scm_rights or bpffs.
CAP_SYS_ADMIN can still detach/unload everything.

Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can
accidentely mess with each other programs and maps.
Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other.

CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data.
Such networking progs cannot access arbitrary kernel memory or leak pointers.

bpftool, bpftrace, bcc tools binaries should NOT be installed with
CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets.
But users with these two permissions will be able to use these tracing tools.

CAP_PERFMON is least secure, since it allows kprobes and kernel memory access.
CAP_NET_ADMIN can stop network traffic via iproute2.
CAP_BPF is the safest from security point of view and harmless on its own.

Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map
and if that map is used by firewall-like bpf prog.
CAP_BPF allows many bpf prog_load commands in parallel. The verifier
may consume large amount of memory and significantly slow down the system.

Existing unprivileged BPF operations are not affected.
In particular unprivileged users are allowed to load socket_filter and cg_skb
program types and to create array, hash, prog_array, map-in-map map types.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-2-alexei.starovoitov@gmail.com


# 98073728 02-Apr-2020 Alexey Budankov <alexey.budankov@linux.intel.com>

capabilities: Introduce CAP_PERFMON to kernel and user space

Introduce the CAP_PERFMON capability designed to secure system
performance monitoring and observability operations so that CAP_PERFMON
can assist CAP_SYS_ADMIN capability in its governing role for
performance monitoring and observability subsystems.

CAP_PERFMON hardens system security and integrity during performance
monitoring and observability operations by decreasing attack surface that
is available to a CAP_SYS_ADMIN privileged process [2]. Providing the access
to system performance monitoring and observability operations under CAP_PERFMON
capability singly, without the rest of CAP_SYS_ADMIN credentials, excludes
chances to misuse the credentials and makes the operation more secure.

Thus, CAP_PERFMON implements the principle of least privilege for
performance monitoring and observability operations (POSIX IEEE 1003.1e:
2.2.2.39 principle of least privilege: A security design principle that
states that a process or program be granted only those privileges
(e.g., capabilities) necessary to accomplish its legitimate function,
and only for the time that such privileges are actually required)

CAP_PERFMON meets the demand to secure system performance monitoring and
observability operations for adoption in security sensitive, restricted,
multiuser production environments (e.g. HPC clusters, cloud and virtual compute
environments), where root or CAP_SYS_ADMIN credentials are not available to
mass users of a system, and securely unblocks applicability and scalability
of system performance monitoring and observability operations beyond root
and CAP_SYS_ADMIN use cases.

CAP_PERFMON takes over CAP_SYS_ADMIN credentials related to system performance
monitoring and observability operations and balances amount of CAP_SYS_ADMIN
credentials following the recommendations in the capabilities man page [1]
for CAP_SYS_ADMIN: "Note: this capability is overloaded; see Notes to kernel
developers, below." For backward compatibility reasons access to system
performance monitoring and observability subsystems of the kernel remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN capability
usage for secure system performance monitoring and observability operations
is discouraged with respect to the designed CAP_PERFMON capability.

Although the software running under CAP_PERFMON can not ensure avoidance
of related hardware issues, the software can still mitigate these issues
following the official hardware issues mitigation procedure [2]. The bugs
in the software itself can be fixed following the standard kernel development
process [3] to maintain and harden security of system performance monitoring
and observability operations.

[1] http://man7.org/linux/man-pages/man7/capabilities.7.html
[2] https://www.kernel.org/doc/html/latest/process/embargoed-hardware-issues.html
[3] https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-man@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Link: http://lore.kernel.org/lkml/5590d543-82c6-490a-6544-08e6a5517db0@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 40852275 22-Jan-2019 Micah Morton <mortonm@chromium.org>

LSM: add SafeSetID module that gates setid calls

This change ensures that the set*uid family of syscalls in kernel/sys.c
(setreuid, setuid, setresuid, setfsuid) all call ns_capable_common with
the CAP_OPT_INSETID flag, so capability checks in the security_capable
hook can know whether they are being called from within a set*uid
syscall. This change is a no-op by itself, but is needed for the
proposed SafeSetID LSM.

Signed-off-by: Micah Morton <mortonm@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.morris@microsoft.com>


# 2fec30e2 23-Jan-2019 Richard Guy Briggs <rgb@redhat.com>

audit: add support for fcaps v3

V3 namespaced file capabilities were introduced in
commit 8db6c34f1dbc ("Introduce v3 namespaced file capabilities")

Add support for these by adding the "frootid" field to the existing
fcaps fields in the NAME and BPRM_FCAPS records.

Please see github issue
https://github.com/linux-audit/audit-kernel/issues/103

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
[PM: comment tweak to fit an 80 char line width]
Signed-off-by: Paul Moore <paul@paul-moore.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 8db6c34f 08-May-2017 Serge E. Hallyn <serge@hallyn.com>

Introduce v3 namespaced file capabilities

Root in a non-initial user ns cannot be trusted to write a traditional
security.capability xattr. If it were allowed to do so, then any
unprivileged user on the host could map his own uid to root in a private
namespace, write the xattr, and execute the file with privilege on the
host.

However supporting file capabilities in a user namespace is very
desirable. Not doing so means that any programs designed to run with
limited privilege must continue to support other methods of gaining and
dropping privilege. For instance a program installer must detect
whether file capabilities can be assigned, and assign them if so but set
setuid-root otherwise. The program in turn must know how to drop
partial capabilities, and do so only if setuid-root.

This patch introduces v3 of the security.capability xattr. It builds a
vfs_ns_cap_data struct by appending a uid_t rootid to struct
vfs_cap_data. This is the absolute uid_t (that is, the uid_t in user
namespace which mounted the filesystem, usually init_user_ns) of the
root id in whose namespaces the file capabilities may take effect.

When a task asks to write a v2 security.capability xattr, if it is
privileged with respect to the userns which mounted the filesystem, then
nothing should change. Otherwise, the kernel will transparently rewrite
the xattr as a v3 with the appropriate rootid. This is done during the
execution of setxattr() to catch user-space-initiated capability writes.
Subsequently, any task executing the file which has the noted kuid as
its root uid, or which is in a descendent user_ns of such a user_ns,
will run the file with capabilities.

Similarly when asking to read file capabilities, a v3 capability will
be presented as v2 if it applies to the caller's namespace.

If a task writes a v3 security.capability, then it can provide a uid for
the xattr so long as the uid is valid in its own user namespace, and it
is privileged with CAP_SETFCAP over its namespace. The kernel will
translate that rootid to an absolute uid, and write that to disk. After
this, a task in the writer's namespace will not be able to use those
capabilities (unless rootid was 0), but a task in a namespace where the
given uid is root will.

Only a single security.capability xattr may exist at a time for a given
file. A task may overwrite an existing xattr so long as it is
privileged over the inode. Note this is a departure from previous
semantics, which required privilege to remove a security.capability
xattr. This check can be re-added if deemed useful.

This allows a simple setxattr to work, allows tar/untar to work, and
allows us to tar in one namespace and untar in another while preserving
the capability, without risking leaking privilege into a parent
namespace.

Example using tar:

$ cp /bin/sleep sleepx
$ mkdir b1 b2
$ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
$ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
$ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
$ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
$ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
b2/sleepx = cap_sys_admin+ep
# /opt/ltp/testcases/bin/getv3xattr b2/sleepx
v3 xattr, rootid is 100001

A patch to linux-test-project adding a new set of tests for this
functionality is in the nsfscaps branch at github.com/hallyn/ltp

Changelog:
Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
Nov 07 2016: convert rootid from and to fs user_ns
(From ebiederm: mar 28 2017)
commoncap.c: fix typos - s/v4/v3
get_vfs_caps_from_disk: clarify the fs_ns root access check
nsfscaps: change the code split for cap_inode_setxattr()
Apr 09 2017:
don't return v3 cap for caps owned by current root.
return a v2 cap for a true v2 cap in non-init ns
Apr 18 2017:
. Change the flow of fscap writing to support s_user_ns writing.
. Remove refuse_fcap_overwrite(). The value of the previous
xattr doesn't matter.
Apr 24 2017:
. incorporate Eric's incremental diff
. move cap_convert_nscap to setxattr and simplify its usage
May 8, 2017:
. fix leaking dentry refcount in cap_inode_getsecurity

Signed-off-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# f84df2a6 16-Nov-2016 Eric W. Biederman <ebiederm@xmission.com>

exec: Ensure mm->user_ns contains the execed files

When the user namespace support was merged the need to prevent
ptrace from revealing the contents of an unreadable executable
was overlooked.

Correct this oversight by ensuring that the executed file
or files are in mm->user_ns, by adjusting mm->user_ns.

Use the new function privileged_wrt_inode_uidgid to see if
the executable is a member of the user namespace, and as such
if having CAP_SYS_PTRACE in the user namespace should allow
tracing the executable. If not update mm->user_ns to
the parent user namespace until an appropriate parent is found.

Cc: stable@vger.kernel.org
Reported-by: Jann Horn <jann@thejh.net>
Fixes: 9e4a36ece652 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 64b875f7 14-Nov-2016 Eric W. Biederman <ebiederm@xmission.com>

ptrace: Capture the ptracer's creds not PT_PTRACE_CAP

When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
overlooked. This can result in incorrect behavior when an application
like strace traces an exec of a setuid executable.

Further PT_PTRACE_CAP does not have enough information for making good
security decisions as it does not report which user namespace the
capability is in. This has already allowed one mistake through
insufficient granulariy.

I found this issue when I was testing another corner case of exec and
discovered that I could not get strace to set PT_PTRACE_CAP even when
running strace as root with a full set of caps.

This change fixes the above issue with strace allowing stracing as
root a setuid executable without disabling setuid. More fundamentaly
this change allows what is allowable at all times, by using the correct
information in it's decision.

Cc: stable@vger.kernel.org
Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# db3f6001 02-Aug-2016 Alexey Dobriyan <adobriyan@gmail.com>

uapi: move forward declarations of internal structures

Don't user forward declarations of internal kernel structures in headers
exported to userspace.

Move "struct completion;".
Move "struct task_struct;".

Link: http://lkml.kernel.org/r/20160713215808.GA22486@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98f368e9 02-Jun-2016 Tyler Hicks <tyhicks@canonical.com>

kernel: Add noaudit variant of ns_capable()

When checking the current cred for a capability in a specific user
namespace, it isn't always desirable to have the LSMs audit the check.
This patch adds a noaudit variant of ns_capable() for when those
situations arise.

The common logic between ns_capable() and the new ns_capable_noaudit()
is moved into a single, shared function to keep duplicated code to a
minimum and ease maintainability.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>


# 0335695d 22-Mar-2016 Arnd Bergmann <arnd@arndb.de>

cred/userns: define current_user_ns() as a function

The current_user_ns() macro currently returns &init_user_ns when user
namespaces are disabled, and that causes several warnings when building
with gcc-6.0 in code that compares the result of the macro to
&init_user_ns itself:

fs/xfs/xfs_ioctl.c: In function 'xfs_ioctl_setattr_check_projid':
fs/xfs/xfs_ioctl.c:1249:22: error: self-comparison always evaluates to true [-Werror=tautological-compare]
if (current_user_ns() == &init_user_ns)

This is a legitimate warning in principle, but here it isn't really
helpful, so I'm reprasing the definition in a way that shuts up the
warning. Apparently gcc only warns when comparing identical literals,
but it can figure out that the result of an inline function can be
identical to a constant expression in order to optimize a condition yet
not warn about the fact that the condition is known at compile time.
This is exactly what we want here, and it looks reasonable because we
generally prefer inline functions over macros anyway.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e42852bf 17-Nov-2015 Yaowei Bai <baiyaowei@cmss.chinamobile.com>

security/capability.h: cap_issubset/isclear can be boolean

This patch makes cap_issubset/isclear return bool due to these
functions only using either one or zero as their return
value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>


# f70be6da 17-Nov-2015 Yaowei Bai <baiyaowei@cmss.chinamobile.com>

security: remove unused cap_is_fs_cap function

Since commit 3bc1fa8a ("LSM: remove BSD secure level security module")
there is no user of cap_is_fs_cap any more, so remove it.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>


# 2813893f 15-Apr-2015 Iulia Manda <iulia.manda21@gmail.com>

kernel: conditionally support non-root users, groups and capabilities

There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.

This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.

When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.

The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.

Also, groups.c is compiled out completely.

In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.

This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.

The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.

Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7d8b6c63 23-Jul-2014 Eric Paris <eparis@redhat.com>

CAPABILITIES: remove undefined caps from all processes

This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
plus fixing it a different way...

We found, when trying to run an application from an application which
had dropped privs that the kernel does security checks on undefined
capability bits. This was ESPECIALLY difficult to debug as those
undefined bits are hidden from /proc/$PID/status.

Consider a root application which drops all capabilities from ALL 4
capability sets. We assume, since the application is going to set
eff/perm/inh from an array that it will clear not only the defined caps
less than CAP_LAST_CAP, but also the higher 28ish bits which are
undefined future capabilities.

The BSET gets cleared differently. Instead it is cleared one bit at a
time. The problem here is that in security/commoncap.c::cap_task_prctl()
we actually check the validity of a capability being read. So any task
which attempts to 'read all things set in bset' followed by 'unset all
things set in bset' will not even attempt to unset the undefined bits
higher than CAP_LAST_CAP.

So the 'parent' will look something like:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffc000000000

All of this 'should' be fine. Given that these are undefined bits that
aren't supposed to have anything to do with permissions. But they do...

So lets now consider a task which cleared the eff/perm/inh completely
and cleared all of the valid caps in the bset (but not the invalid caps
it couldn't read out of the kernel). We know that this is exactly what
the libcap-ng library does and what the go capabilities library does.
They both leave you in that above situation if you try to clear all of
you capapabilities from all 4 sets. If that root task calls execve()
the child task will pick up all caps not blocked by the bset. The bset
however does not block bits higher than CAP_LAST_CAP. So now the child
task has bits in eff which are not in the parent. These are
'meaningless' undefined bits, but still bits which the parent doesn't
have.

The problem is now in cred_cap_issubset() (or any operation which does a
subset test) as the child, while a subset for valid cap bits, is not a
subset for invalid cap bits! So now we set durring commit creds that
the child is not dumpable. Given it is 'more priv' than its parent. It
also means the parent cannot ptrace the child and other stupidity.

The solution here:
1) stop hiding capability bits in status
This makes debugging easier!

2) stop giving any task undefined capability bits. it's simple, it you
don't put those invalid bits in CAP_FULL_SET you won't get them in init
and you won't get them in any other task either.
This fixes the cap_issubset() tests and resulting fallout (which
made the init task in a docker container untraceable among other
things)

3) mask out undefined bits when sys_capset() is called as it might use
~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

4) mask out undefined bit when we read a file capability off of disk as
again likely all bits are set in the xattr for forward/backward
compatibility.
This lets 'setcap all+pe /bin/bash; /bin/bash' run

Signed-off-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Steve Grubb <sgrubb@redhat.com>
Cc: Dan Walsh <dwalsh@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>


# 23adbe12 10-Jun-2014 Andy Lutomirski <luto@amacapital.net>

fs,userns: Change inode_capable to capable_wrt_inode_uidgid

The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces. For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c7b96acf 20-Mar-2013 Eric W. Biederman <ebiederm@xmission.com>

userns: Kill nsown_capable it makes the wrong thing easy

nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 935d8aab 14-Apr-2013 Linus Torvalds <torvalds@linux-foundation.org>

Add file_ns_capable() helper function for open-time capability checking

Nothing is using it yet, but this will allow us to delay the open-time
checks to use time, without breaking the normal UNIX permission
semantics where permissions are determined by the opener (and the file
descriptor can then be passed to a different process, or the process can
drop capabilities).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 607ca46e 13-Oct-2012 David Howells <dhowells@redhat.com>

UAPI: (Scripted) Disintegrate include/linux

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>


# d9914cf6 17-Jul-2012 Michael Kerrisk <mtk.manpages@gmail.com>

PM: Rename CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND

As discussed in
http://thread.gmane.org/gmane.linux.kernel/1249726/focus=1288990,
the capability introduced in 4d7e30d98939a0340022ccd49325a3d70f7e0238
to govern EPOLLWAKEUP seems misnamed: this capability is about governing
the ability to suspend the system, not using a particular API flag
(EPOLLWAKEUP). We should make the name of the capability more general
to encourage reuse in related cases. (Whether or not this capability
should also be used to govern the use of /sys/power/wake_lock is a
question that needs to be separately resolved.)

This patch renames the capability to CAP_BLOCK_SUSPEND. In order to ensure
that the old capability name doesn't make it out into the wild, could you
please apply and push up the tree to ensure that it is incorporated
for the 3.5 release.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>


# 4d7e30d9 01-May-2012 Arve Hjønnevåg <arve@android.com>

epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready

When an epoll_event, that has the EPOLLWAKEUP flag set, is ready, a
wakeup_source will be active to prevent suspend. This can be used to
handle wakeup events from a driver that support poll, e.g. input, if
that driver wakes up the waitqueue passed to epoll before allowing
suspend.

Signed-off-by: Arve Hjønnevåg <arve@android.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>


# 1a48e2ac 14-Nov-2011 Eric W. Biederman <ebiederm@xmission.com>

userns: Replace the hard to write inode_userns with inode_capable.

This represents a change in strategy of how to handle user namespaces.
Instead of tagging everything explicitly with a user namespace and bulking
up all of the comparisons of uids and gids in the kernel, all uids and gids
in use will have a mapping to a flat kuid and kgid spaces respectively. This
allows much more of the existing logic to be preserved and in general
allows for faster code.

In this new and improved world we allow someone to utiliize capabilities
over an inode if the inodes owner mapps into the capabilities holders user
namespace and the user has capabilities in their user namespace. Which
is simple and efficient.

Moving the fs uid comparisons to be comparisons in a flat kuid space
follows in later patches, something that is only significant if you
are using user namespaces.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# f423e5ba 02-Jan-2012 Eric Paris <eparis@redhat.com>

capabilities: remove __cap_full_set definition

In 5163b583a036b103c3cec7171d6731c125773ed6 I removed __cap_full_set but
forgot to remove it from a header. Do that.

Reported-by: Kornilios Kourtis <kkourt@cslab.ece.ntua.gr>
Signed-off-by: Eric Paris <eparis@redhat.com>


# f1c84dae 02-Jan-2012 Eric Paris <eparis@redhat.com>

capabilities: remove task_ns_* functions

task_ in the front of a function, in the security subsystem anyway, means
to me at least, that we are operating with that task as the subject of the
security decision. In this case what it means is that we are using current as
the subject but we use the task to get the right namespace. Who in the world
would ever realize that's what task_ns_capability means just by the name? This
patch eliminates the task_ns functions entirely and uses the has_ns_capability
function instead. This means we explicitly open code the ns in question in
the caller. I think it makes the caller a LOT more clear what is going on.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>


# 7b61d648 02-Jan-2012 Eric Paris <eparis@redhat.com>

capabilites: introduce new has_ns_capabilities_noaudit

For consistency in interfaces, introduce a new interface called
has_ns_capabilities_noaudit. It checks if the given task has the given
capability in the given namespace. Use this new function by
has_capabilities_noaudit.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>


# 6cc7a765 20-Oct-2011 Maciej Żenczykowski <maze@google.com>

net: allow CAP_NET_RAW to set socket options IP{,V6}_TRANSPARENT

Up till now the IP{,V6}_TRANSPARENT socket options (which actually set
the same bit in the socket struct) have required CAP_NET_ADMIN
privileges to set or clear the option.

- we make clearing the bit not require any privileges.
- we allow CAP_NET_ADMIN to set the bit (as before this change)
- we allow CAP_NET_RAW to set this bit, because raw
sockets already pretty much effectively allow you
to emulate socket transparency.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 47a150ed 12-May-2011 Serge E. Hallyn <serge.hallyn@canonical.com>

Cache user_ns in struct cred

If !CONFIG_USERNS, have current_user_ns() defined to (&init_user_ns).

Get rid of _current_user_ns. This requires nsown_capable() to be
defined in capability.c rather than as static inline in capability.h,
so do that.

Request_key needs init_user_ns defined at current_user_ns if
!CONFIG_USERNS, so forward-declare that in cred.h if !CONFIG_USERNS
at current_user_ns() define.

Compile-tested with and without CONFIG_USERNS.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
[ This makes a huge performance difference for acl_permission_check(),
up to 30%. And that is one of the hottest kernel functions for loads
that are pathname-lookup heavy. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a7adcf5 11-Jan-2011 John Stultz <john.stultz@linaro.org>

timers: Posix interface for alarm-timers

This patch exposes alarm-timers to userland via the posix clock
and timers interface, using two new clockids: CLOCK_REALTIME_ALARM
and CLOCK_BOOTTIME_ALARM. Both clockids behave identically to
CLOCK_REALTIME and CLOCK_BOOTTIME, respectively, but timers
set against the _ALARM suffixed clockids will wake the system if
it is suspended.

Some background can be found here:
https://lwn.net/Articles/429925/

The concept for Alarm-timers was inspired by the Android Alarm
driver (by Arve Hjønnevåg) found in the Android kernel tree.

See: http://android.git.kernel.org/?p=kernel/common.git;a=blob;f=drivers/rtc/alarm.c;h=1250edfbdf3302f5e4ea6194847c6ef4bb7beb1c;hb=android-2.6.36

While the in-kernel interface is pretty similar between
alarm-timers and Android alarm driver, the user-space interface
for the Android alarm driver is via ioctls to a new char device.
As mentioned above, I've instead chosen to export this functionality
via the posix interface, as it seemed a little simpler and avoids
creating duplicate interfaces to things like CLOCK_REALTIME and
CLOCK_MONOTONIC under alternate names (ie:ANDROID_ALARM_RTC and
ANDROID_ALARM_SYSTEMTIME).

The semantics of the Android alarm driver are different from what
this posix interface provides. For instance, threads other then
the thread waiting on the Android alarm driver are able to modify
the alarm being waited on. Also this interface does not allow
the same wakelock semantics that the Android driver provides
(ie: kernel takes a wakelock on RTC alarm-interupt, and holds it
through process wakeup, and while the process runs, until the
process either closes the char device or calls back in to wait
on a new alarm).

One potential way to implement similar semantics may be via
the timerfd infrastructure, but this needs more research.

There may also need to be some sort of sysfs system level policy
hooks that allow alarm timers to be disabled to keep them
from firing at inappropriate times (ie: laptop in a well insulated
bag, mid-flight).

CC: Arve Hjønnevåg <arve@android.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>


# a3232d2f 01-Apr-2011 Eric Paris <eparis@redhat.com>

capabilities: delete all CAP_INIT macros

The CAP_INIT macros of INH, BSET, and EFF made sense at one point in time,
but now days they aren't helping. Just open code the logic in the
init_cred.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 5163b583 01-Apr-2011 Eric Paris <eparis@redhat.com>

capabilities: delete unused cap_set_full

unused code. Clean it up.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>


# ffa8e59d 01-Apr-2011 Eric Paris <eparis@redhat.com>

capabilities: do not drop CAP_SETPCAP from the initial task

In olden' days of yore CAP_SETPCAP had special meaning for the init task.
We actually have code to make sure that CAP_SETPCAP wasn't in pE of things
using the init_cred. But CAP_SETPCAP isn't so special any more and we
don't have a reason to special case dropping it for init or kthreads....

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>


# 3263245d 23-Mar-2011 Serge E. Hallyn <serge.hallyn@canonical.com>

userns: make has_capability* into real functions

So we can let type safety keep things sane, and as a bonus we can remove
the declaration of init_user_ns in capability.h.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8409cca7 23-Mar-2011 Serge E. Hallyn <serge@hallyn.com>

userns: allow ptrace from non-init user namespaces

ptrace is allowed to tasks in the same user namespace according to the
usual rules (i.e. the same rules as for two tasks in the init user
namespace). ptrace is also allowed to a user namespace to which the
current task the has CAP_SYS_PTRACE capability.

Changelog:
Dec 31: Address feedback by Eric:
. Correct ptrace uid check
. Rename may_ptrace_ns to ptrace_capable
. Also fix the cap_ptrace checks.
Jan 1: Use const cred struct
Jan 11: use task_ns_capable() in place of ptrace_capable().
Feb 23: same_or_ancestore_user_ns() was not an appropriate
check to constrain cap_issubset. Rather, cap_issubset()
only is meaningful when both capsets are in the same
user_ns.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3486740a 23-Mar-2011 Serge E. Hallyn <serge@hallyn.com>

userns: security: make capabilities relative to the user namespace

- Introduce ns_capable to test for a capability in a non-default
user namespace.
- Teach cap_capable to handle capabilities in a non-default
user namespace.

The motivation is to get to the unprivileged creation of new
namespaces. It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.

I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.

Changelog:
11/05/2010: [serge] add apparmor
12/14/2010: [serge] fix capabilities to created user namespaces
Without this, if user serge creates a user_ns, he won't have
capabilities to the user_ns he created. THis is because we
were first checking whether his effective caps had the caps
he needed and returning -EPERM if not, and THEN checking whether
he was the creator. Reverse those checks.
12/16/2010: [serge] security_real_capable needs ns argument in !security case
01/11/2011: [serge] add task_ns_capable helper
01/11/2011: [serge] add nsown_capable() helper per Bastian Blank suggestion
02/16/2011: [serge] fix a logic bug: the root user is always creator of
init_user_ns, but should not always have capabilities to
it! Fix the check in cap_capable().
02/21/2011: Add the required user_ns parameter to security_capable,
fixing a compile failure.
02/23/2011: Convert some macros to functions as per akpm comments. Some
couldn't be converted because we can't easily forward-declare
them (they are inline if !SECURITY, extern if SECURITY). Add
a current_user_ns function so we can use it in capability.h
without #including cred.h. Move all forward declarations
together to the top of the #ifdef __KERNEL__ section, and use
kernel-doc format.
02/23/2011: Per dhowells, clean up comment in cap_capable().
02/23/2011: Per akpm, remove unreachable 'return -EPERM' in cap_capable.

(Original written and signed off by Eric; latest, modified version
acked by him)

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: export current_user_ns() for ecryptfs]
[serge.hallyn@canonical.com: remove unneeded extra argument in selinux's task_has_capability]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce6ada35 25-Nov-2010 Serge E. Hallyn <serge@hallyn.com>

security: Define CAP_SYSLOG

Privileged syslog operations currently require CAP_SYS_ADMIN. Split
this off into a new CAP_SYSLOG privilege which we can sanely take away
from a container through the capability bounding set.

With this patch, an lxc container can be prevented from messing with
the host's syslog (i.e. dmesg -c).

Changelog: mar 12 2010: add selinux capability2:cap_syslog perm
Changelog: nov 22 2010:
. port to new kernel
. add a WARN_ONCE if userspace isn't using CAP_SYSLOG

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Acked-By: Kees Cook <kees.cook@canonical.com>
Cc: James Morris <jmorris@namei.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: "Christopher J. PeBenito" <cpebenito@tresys.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: James Morris <jmorris@namei.org>


# af4f1360 01-Jul-2010 Mimi Zohar <zohar@linux.vnet.ibm.com>

security: move LSM xattrnames to xattr.h

Make the security extended attributes names global. Updated to move
the remaining Smack xattrs.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# b3a222e5 23-Nov-2009 Serge E. Hallyn <serue@us.ibm.com>

remove CONFIG_SECURITY_FILE_CAPABILITIES compile option

As far as I know, all distros currently ship kernels with default
CONFIG_SECURITY_FILE_CAPABILITIES=y. Since having the option on
leaves a 'no_file_caps' option to boot without file capabilities,
the main reason to keep the option is that turning it off saves
you (on my s390x partition) 5k. In particular, vmlinux sizes
came to:

without patch fscaps=n: 53598392
without patch fscaps=y: 53603406
with this patch applied: 53603342

with the security-next tree.

Against this we must weigh the fact that there is no simple way for
userspace to figure out whether file capabilities are supported,
while things like per-process securebits, capability bounding
sets, and adding bits to pI if CAP_SETPCAP is in pE are not supported
with SECURITY_FILE_CAPABILITIES=n, leaving a bit of a problem for
applications wanting to know whether they can use them and/or why
something failed.

It also adds another subtly different set of semantics which we must
maintain at the risk of severe security regressions.

So this patch removes the SECURITY_FILE_CAPABILITIES compile
option. It drops the kernel size by about 50k over the stock
SECURITY_FILE_CAPABILITIES=y kernel, by removing the
cap_limit_ptraced_target() function.

Changelog:
Nov 20: remove cap_limit_ptraced_target() as it's logic
was ifndef'ed.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Andrew G. Morgan" <morgan@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>


# bcf56442 16-Jun-2009 GeunSik Lim <leemgs1@gmail.com>

trivial: change address of the libcap source.

This is patch to change ftp site of the libcap source.
"ftp://linux.kernel.org" address does not exist.

Signed-off-by: GeunSik Lim <geunsik.lim@samsung.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 0ad30b8f 13-Apr-2009 Serge E. Hallyn <serue@us.ibm.com>

add some long-missing capabilities to fs_mask

When POSIX capabilities were introduced during the 2.1 Linux
cycle, the fs mask, which represents the capabilities which having
fsuid==0 is supposed to grant, did not include CAP_MKNOD and
CAP_LINUX_IMMUTABLE. However, before capabilities the privilege
to call these did in fact depend upon fsuid==0.

This patch introduces those capabilities into the fsmask,
restoring the old behavior.

See the thread starting at http://lkml.org/lkml/2009/3/11/157 for
reference.

Note that if this fix is deemed valid, then earlier kernel versions (2.4
and 2.2) ought to be fixed too.

Changelog:
[Mar 23] Actually delete old CAP_FS_SET definition...
[Mar 20] Updated against J. Bruce Fields's patch

Reported-by: Igor Zhbanov <izh1979@gmail.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: stable@kernel.org
Cc: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 76a67ec6 16-Mar-2009 J. Bruce Fields <bfields@citi.umich.edu>

nfsd: nfsd should drop CAP_MKNOD for non-root

Since creating a device node is normally an operation requiring special
privilege, Igor Zhbanov points out that it is surprising (to say the
least) that a client can, for example, create a device node on a
filesystem exported with root_squash.

So, make sure CAP_MKNOD is among the capabilities dropped when an nfsd
thread handles a request from a non-root user.

Reported-by: Igor Zhbanov <izh1979@gmail.com>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>


# 9fa91d99 30-Jan-2009 Jaswinder Singh Rajput <jaswinderrajput@gmail.com>

headers_check fix: linux/capability.h

fix the following 'make headers_check' warning:

usr/include/linux/capability.h:73: extern's make no sense in userspace

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>


# 3699c53c 06-Jan-2009 David Howells <dhowells@redhat.com>

CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #3]

Fix a regression in cap_capable() due to:

commit 3b11a1decef07c19443d24ae926982bc8ec9f4c0
Author: David Howells <dhowells@redhat.com>
Date: Fri Nov 14 10:39:26 2008 +1100

CRED: Differentiate objective and effective subjective credentials on a task

The problem is that the above patch allows a process to have two sets of
credentials, and for the most part uses the subjective credentials when
accessing current's creds.

There is, however, one exception: cap_capable(), and thus capable(), uses the
real/objective credentials of the target task, whether or not it is the current
task.

Ordinarily this doesn't matter, since usually the two cred pointers in current
point to the same set of creds. However, sys_faccessat() makes use of this
facility to override the credentials of the calling process to make its test,
without affecting the creds as seen from other processes.

One of the things sys_faccessat() does is to make an adjustment to the
effective capabilities mask, which cap_capable(), as it stands, then ignores.

The affected capability check is in generic_permission():

if (!(mask & MAY_EXEC) || execute_ok(inode))
if (capable(CAP_DAC_OVERRIDE))
return 0;

This change passes the set of credentials to be tested down into the commoncap
and SELinux code. The security functions called by capable() and
has_capability() select the appropriate set of credentials from the process
being checked.

This can be tested by compiling the following program from the XFS testsuite:

/*
* t_access_root.c - trivial test program to show permission bug.
*
* Written by Michael Kerrisk - copyright ownership not pursued.
* Sourced from: http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-10/6030.html
*/
#include <limits.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>

#define UID 500
#define GID 100
#define PERM 0
#define TESTPATH "/tmp/t_access"

static void
errExit(char *msg)
{
perror(msg);
exit(EXIT_FAILURE);
} /* errExit */

static void
accessTest(char *file, int mask, char *mstr)
{
printf("access(%s, %s) returns %d\n", file, mstr, access(file, mask));
} /* accessTest */

int
main(int argc, char *argv[])
{
int fd, perm, uid, gid;
char *testpath;
char cmd[PATH_MAX + 20];

testpath = (argc > 1) ? argv[1] : TESTPATH;
perm = (argc > 2) ? strtoul(argv[2], NULL, 8) : PERM;
uid = (argc > 3) ? atoi(argv[3]) : UID;
gid = (argc > 4) ? atoi(argv[4]) : GID;

unlink(testpath);

fd = open(testpath, O_RDWR | O_CREAT, 0);
if (fd == -1) errExit("open");

if (fchown(fd, uid, gid) == -1) errExit("fchown");
if (fchmod(fd, perm) == -1) errExit("fchmod");
close(fd);

snprintf(cmd, sizeof(cmd), "ls -l %s", testpath);
system(cmd);

if (seteuid(uid) == -1) errExit("seteuid");

accessTest(testpath, 0, "0");
accessTest(testpath, R_OK, "R_OK");
accessTest(testpath, W_OK, "W_OK");
accessTest(testpath, X_OK, "X_OK");
accessTest(testpath, R_OK | W_OK, "R_OK | W_OK");
accessTest(testpath, R_OK | X_OK, "R_OK | X_OK");
accessTest(testpath, W_OK | X_OK, "W_OK | X_OK");
accessTest(testpath, R_OK | W_OK | X_OK, "R_OK | W_OK | X_OK");

exit(EXIT_SUCCESS);
} /* main */

This can be run against an Ext3 filesystem as well as against an XFS
filesystem. If successful, it will show:

[root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043
---------- 1 dhowells dhowells 0 2008-12-31 03:00 /tmp/xxx
access(/tmp/xxx, 0) returns 0
access(/tmp/xxx, R_OK) returns 0
access(/tmp/xxx, W_OK) returns 0
access(/tmp/xxx, X_OK) returns -1
access(/tmp/xxx, R_OK | W_OK) returns 0
access(/tmp/xxx, R_OK | X_OK) returns -1
access(/tmp/xxx, W_OK | X_OK) returns -1
access(/tmp/xxx, R_OK | W_OK | X_OK) returns -1

If unsuccessful, it will show:

[root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043
---------- 1 dhowells dhowells 0 2008-12-31 02:56 /tmp/xxx
access(/tmp/xxx, 0) returns 0
access(/tmp/xxx, R_OK) returns -1
access(/tmp/xxx, W_OK) returns -1
access(/tmp/xxx, X_OK) returns -1
access(/tmp/xxx, R_OK | W_OK) returns -1
access(/tmp/xxx, R_OK | X_OK) returns -1
access(/tmp/xxx, W_OK | X_OK) returns -1
access(/tmp/xxx, R_OK | W_OK | X_OK) returns -1

I've also tested the fix with the SELinux and syscalls LTP testsuites.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: J. Bruce Fields <bfields@citi.umich.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 29881c45 06-Jan-2009 James Morris <jmorris@namei.org>

Revert "CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #2]"

This reverts commit 14eaddc967b16017d4a1a24d2be6c28ecbe06ed8.

David has a better version to come.


# 14eaddc9 31-Dec-2008 David Howells <dhowells@redhat.com>

CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #2]

Fix a regression in cap_capable() due to:

commit 5ff7711e635b32f0a1e558227d030c7e45b4a465
Author: David Howells <dhowells@redhat.com>
Date: Wed Dec 31 02:52:28 2008 +0000

CRED: Differentiate objective and effective subjective credentials on a task

The problem is that the above patch allows a process to have two sets of
credentials, and for the most part uses the subjective credentials when
accessing current's creds.

There is, however, one exception: cap_capable(), and thus capable(), uses the
real/objective credentials of the target task, whether or not it is the current
task.

Ordinarily this doesn't matter, since usually the two cred pointers in current
point to the same set of creds. However, sys_faccessat() makes use of this
facility to override the credentials of the calling process to make its test,
without affecting the creds as seen from other processes.

One of the things sys_faccessat() does is to make an adjustment to the
effective capabilities mask, which cap_capable(), as it stands, then ignores.

The affected capability check is in generic_permission():

if (!(mask & MAY_EXEC) || execute_ok(inode))
if (capable(CAP_DAC_OVERRIDE))
return 0;

This change splits capable() from has_capability() down into the commoncap and
SELinux code. The capable() security op now only deals with the current
process, and uses the current process's subjective creds. A new security op -
task_capable() - is introduced that can check any task's objective creds.

strictly the capable() security op is superfluous with the presence of the
task_capable() op, however it should be faster to call the capable() op since
two fewer arguments need be passed down through the various layers.

This can be tested by compiling the following program from the XFS testsuite:

/*
* t_access_root.c - trivial test program to show permission bug.
*
* Written by Michael Kerrisk - copyright ownership not pursued.
* Sourced from: http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-10/6030.html
*/
#include <limits.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>

#define UID 500
#define GID 100
#define PERM 0
#define TESTPATH "/tmp/t_access"

static void
errExit(char *msg)
{
perror(msg);
exit(EXIT_FAILURE);
} /* errExit */

static void
accessTest(char *file, int mask, char *mstr)
{
printf("access(%s, %s) returns %d\n", file, mstr, access(file, mask));
} /* accessTest */

int
main(int argc, char *argv[])
{
int fd, perm, uid, gid;
char *testpath;
char cmd[PATH_MAX + 20];

testpath = (argc > 1) ? argv[1] : TESTPATH;
perm = (argc > 2) ? strtoul(argv[2], NULL, 8) : PERM;
uid = (argc > 3) ? atoi(argv[3]) : UID;
gid = (argc > 4) ? atoi(argv[4]) : GID;

unlink(testpath);

fd = open(testpath, O_RDWR | O_CREAT, 0);
if (fd == -1) errExit("open");

if (fchown(fd, uid, gid) == -1) errExit("fchown");
if (fchmod(fd, perm) == -1) errExit("fchmod");
close(fd);

snprintf(cmd, sizeof(cmd), "ls -l %s", testpath);
system(cmd);

if (seteuid(uid) == -1) errExit("seteuid");

accessTest(testpath, 0, "0");
accessTest(testpath, R_OK, "R_OK");
accessTest(testpath, W_OK, "W_OK");
accessTest(testpath, X_OK, "X_OK");
accessTest(testpath, R_OK | W_OK, "R_OK | W_OK");
accessTest(testpath, R_OK | X_OK, "R_OK | X_OK");
accessTest(testpath, W_OK | X_OK, "W_OK | X_OK");
accessTest(testpath, R_OK | W_OK | X_OK, "R_OK | W_OK | X_OK");

exit(EXIT_SUCCESS);
} /* main */

This can be run against an Ext3 filesystem as well as against an XFS
filesystem. If successful, it will show:

[root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043
---------- 1 dhowells dhowells 0 2008-12-31 03:00 /tmp/xxx
access(/tmp/xxx, 0) returns 0
access(/tmp/xxx, R_OK) returns 0
access(/tmp/xxx, W_OK) returns 0
access(/tmp/xxx, X_OK) returns -1
access(/tmp/xxx, R_OK | W_OK) returns 0
access(/tmp/xxx, R_OK | X_OK) returns -1
access(/tmp/xxx, W_OK | X_OK) returns -1
access(/tmp/xxx, R_OK | W_OK | X_OK) returns -1

If unsuccessful, it will show:

[root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043
---------- 1 dhowells dhowells 0 2008-12-31 02:56 /tmp/xxx
access(/tmp/xxx, 0) returns 0
access(/tmp/xxx, R_OK) returns -1
access(/tmp/xxx, W_OK) returns -1
access(/tmp/xxx, X_OK) returns -1
access(/tmp/xxx, R_OK | W_OK) returns -1
access(/tmp/xxx, R_OK | X_OK) returns -1
access(/tmp/xxx, W_OK | X_OK) returns -1
access(/tmp/xxx, R_OK | W_OK | X_OK) returns -1

I've also tested the fix with the SELinux and syscalls LTP testsuites.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>


# d84f4f99 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Inaugurate COW credentials

Inaugurate copy-on-write credentials management. This uses RCU to manage the
credentials pointer in the task_struct with respect to accesses by other tasks.
A process may only modify its own credentials, and so does not need locking to
access or modify its own credentials.

A mutex (cred_replace_mutex) is added to the task_struct to control the effect
of PTRACE_ATTACHED on credential calculations, particularly with respect to
execve().

With this patch, the contents of an active credentials struct may not be
changed directly; rather a new set of credentials must be prepared, modified
and committed using something like the following sequence of events:

struct cred *new = prepare_creds();
int ret = blah(new);
if (ret < 0) {
abort_creds(new);
return ret;
}
return commit_creds(new);

There are some exceptions to this rule: the keyrings pointed to by the active
credentials may be instantiated - keyrings violate the COW rule as managing
COW keyrings is tricky, given that it is possible for a task to directly alter
the keys in a keyring in use by another task.

To help enforce this, various pointers to sets of credentials, such as those in
the task_struct, are declared const. The purpose of this is compile-time
discouragement of altering credentials through those pointers. Once a set of
credentials has been made public through one of these pointers, it may not be
modified, except under special circumstances:

(1) Its reference count may incremented and decremented.

(2) The keyrings to which it points may be modified, but not replaced.

The only safe way to modify anything else is to create a replacement and commit
using the functions described in Documentation/credentials.txt (which will be
added by a later patch).

This patch and the preceding patches have been tested with the LTP SELinux
testsuite.

This patch makes several logical sets of alteration:

(1) execve().

This now prepares and commits credentials in various places in the
security code rather than altering the current creds directly.

(2) Temporary credential overrides.

do_coredump() and sys_faccessat() now prepare their own credentials and
temporarily override the ones currently on the acting thread, whilst
preventing interference from other threads by holding cred_replace_mutex
on the thread being dumped.

This will be replaced in a future patch by something that hands down the
credentials directly to the functions being called, rather than altering
the task's objective credentials.

(3) LSM interface.

A number of functions have been changed, added or removed:

(*) security_capset_check(), ->capset_check()
(*) security_capset_set(), ->capset_set()

Removed in favour of security_capset().

(*) security_capset(), ->capset()

New. This is passed a pointer to the new creds, a pointer to the old
creds and the proposed capability sets. It should fill in the new
creds or return an error. All pointers, barring the pointer to the
new creds, are now const.

(*) security_bprm_apply_creds(), ->bprm_apply_creds()

Changed; now returns a value, which will cause the process to be
killed if it's an error.

(*) security_task_alloc(), ->task_alloc_security()

Removed in favour of security_prepare_creds().

(*) security_cred_free(), ->cred_free()

New. Free security data attached to cred->security.

(*) security_prepare_creds(), ->cred_prepare()

New. Duplicate any security data attached to cred->security.

(*) security_commit_creds(), ->cred_commit()

New. Apply any security effects for the upcoming installation of new
security by commit_creds().

(*) security_task_post_setuid(), ->task_post_setuid()

Removed in favour of security_task_fix_setuid().

(*) security_task_fix_setuid(), ->task_fix_setuid()

Fix up the proposed new credentials for setuid(). This is used by
cap_set_fix_setuid() to implicitly adjust capabilities in line with
setuid() changes. Changes are made to the new credentials, rather
than the task itself as in security_task_post_setuid().

(*) security_task_reparent_to_init(), ->task_reparent_to_init()

Removed. Instead the task being reparented to init is referred
directly to init's credentials.

NOTE! This results in the loss of some state: SELinux's osid no
longer records the sid of the thread that forked it.

(*) security_key_alloc(), ->key_alloc()
(*) security_key_permission(), ->key_permission()

Changed. These now take cred pointers rather than task pointers to
refer to the security context.

(4) sys_capset().

This has been simplified and uses less locking. The LSM functions it
calls have been merged.

(5) reparent_to_kthreadd().

This gives the current thread the same credentials as init by simply using
commit_thread() to point that way.

(6) __sigqueue_alloc() and switch_uid()

__sigqueue_alloc() can't stop the target task from changing its creds
beneath it, so this function gets a reference to the currently applicable
user_struct which it then passes into the sigqueue struct it returns if
successful.

switch_uid() is now called from commit_creds(), and possibly should be
folded into that. commit_creds() should take care of protecting
__sigqueue_alloc().

(7) [sg]et[ug]id() and co and [sg]et_current_groups.

The set functions now all use prepare_creds(), commit_creds() and
abort_creds() to build and check a new set of credentials before applying
it.

security_task_set[ug]id() is called inside the prepared section. This
guarantees that nothing else will affect the creds until we've finished.

The calling of set_dumpable() has been moved into commit_creds().

Much of the functionality of set_user() has been moved into
commit_creds().

The get functions all simply access the data directly.

(8) security_task_prctl() and cap_task_prctl().

security_task_prctl() has been modified to return -ENOSYS if it doesn't
want to handle a function, or otherwise return the return value directly
rather than through an argument.

Additionally, cap_task_prctl() now prepares a new set of credentials, even
if it doesn't end up using it.

(9) Keyrings.

A number of changes have been made to the keyrings code:

(a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
all been dropped and built in to the credentials functions directly.
They may want separating out again later.

(b) key_alloc() and search_process_keyrings() now take a cred pointer
rather than a task pointer to specify the security context.

(c) copy_creds() gives a new thread within the same thread group a new
thread keyring if its parent had one, otherwise it discards the thread
keyring.

(d) The authorisation key now points directly to the credentials to extend
the search into rather pointing to the task that carries them.

(e) Installing thread, process or session keyrings causes a new set of
credentials to be created, even though it's not strictly necessary for
process or session keyrings (they're shared).

(10) Usermode helper.

The usermode helper code now carries a cred struct pointer in its
subprocess_info struct instead of a new session keyring pointer. This set
of credentials is derived from init_cred and installed on the new process
after it has been cloned.

call_usermodehelper_setup() allocates the new credentials and
call_usermodehelper_freeinfo() discards them if they haven't been used. A
special cred function (prepare_usermodeinfo_creds()) is provided
specifically for call_usermodehelper_setup() to call.

call_usermodehelper_setkeys() adjusts the credentials to sport the
supplied keyring as the new session keyring.

(11) SELinux.

SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:

(a) selinux_setprocattr() no longer does its check for whether the
current ptracer can access processes with the new SID inside the lock
that covers getting the ptracer's SID. Whilst this lock ensures that
the check is done with the ptracer pinned, the result is only valid
until the lock is released, so there's no point doing it inside the
lock.

(12) is_single_threaded().

This function has been extracted from selinux_setprocattr() and put into
a file of its own in the lib/ directory as join_session_keyring() now
wants to use it too.

The code in SELinux just checked to see whether a task shared mm_structs
with other tasks (CLONE_VM), but that isn't good enough. We really want
to know if they're part of the same thread group (CLONE_THREAD).

(13) nfsd.

The NFS server daemon now has to use the COW credentials to set the
credentials it is going to use. It really needs to pass the credentials
down to the functions it calls, but it can't do that until other patches
in this series have been applied.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>


# 92a77aac 12-Nov-2008 James Morris <jmorris@namei.org>

security: remove broken and useless declarations

Remove broken declarations for security_capable* functions,
which were not needed anyway.

Signed-off-by: James Morris <jmorris@namei.org>


# 06112163 11-Nov-2008 Eric Paris <eparis@redhat.com>

Add a new capable interface that will be used by systems that use audit to
make an A or B type decision instead of a security decision. Currently
this is the case at least for filesystems when deciding if a process can use
the reserved 'root' blocks and for the case of things like the oom
algorithm determining if processes are root processes and should be less
likely to be killed. These types of security system requests should not be
audited or logged since they are not really security decisions. It would be
possible to solve this problem like the vm_enough_memory security check did
by creating a new LSM interface and moving all of the policy into that
interface but proves the needlessly bloat the LSM and provide complex
indirection.

This merely allows those decisions to be made where they belong and to not
flood logs or printk with denials for thing that are not security decisions.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>


# 851f7ff5 11-Nov-2008 Eric Paris <eparis@redhat.com>

This patch will print cap_permitted and cap_inheritable data in the PATH
records of any file that has file capabilities set. Files which do not
have fcaps set will not have different PATH records.

An example audit record if you run:
setcap "cap_net_admin+pie" /bin/bash
/bin/bash

type=SYSCALL msg=audit(1225741937.363:230): arch=c000003e syscall=59 success=yes exit=0 a0=2119230 a1=210da30 a2=20ee290 a3=8 items=2 ppid=2149 pid=2923 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="ping" exe="/bin/ping" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
type=EXECVE msg=audit(1225741937.363:230): argc=2 a0="ping" a1="www.google.com"
type=CWD msg=audit(1225741937.363:230): cwd="/root"
type=PATH msg=audit(1225741937.363:230): item=0 name="/bin/ping" inode=49256 dev=fd:00 mode=0104755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ping_exec_t:s0 cap_fp=0000000000002000 cap_fi=0000000000002000 cap_fe=1 cap_fver=2
type=PATH msg=audit(1225741937.363:230): item=1 name=(null) inode=507915 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# c0b00441 11-Nov-2008 Eric Paris <eparis@redhat.com>

This patch add a generic cpu endian caps structure and externally available
functions which retrieve fcaps information from disk. This information is
necessary so fcaps information can be collected and recorded by the audit
system.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 9d36be76 11-Nov-2008 Eric Paris <eparis@redhat.com>

Document the order of arguments for cap_issubset. It's not instantly clear
which order the argument should be in. So give an example.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 1f29fae2 05-Nov-2008 Serge E. Hallyn <serue@us.ibm.com>

file capabilities: add no_file_caps switch (v4)

Add a no_file_caps boot option when file capabilities are
compiled into the kernel (CONFIG_SECURITY_FILE_CAPABILITIES=y).

This allows distributions to ship a kernel with file capabilities
compiled in, without forcing users to use (and understand and
trust) them.

When no_file_caps is specified at boot, then when a process executes
a file, any file capabilities stored with that file will not be
used in the calculation of the process' new capability sets.

This means that booting with the no_file_caps boot option will
not be the same as booting a kernel with file capabilities
compiled out - in particular a task with CAP_SETPCAP will not
have any chance of passing capabilities to another task (which
isn't "really" possible anyway, and which may soon by killed
altogether by David Howells in any case), and it will instead
be able to put new capabilities in its pI. However since fI
will always be empty and pI is masked with fI, it gains the
task nothing.

We also support the extra prctl options, setting securebits and
dropping capabilities from the per-process bounding set.

The other remaining difference is that killpriv, task_setscheduler,
setioprio, and setnice will continue to be hooked. That will
be noticable in the case where a root task changed its uid
while keeping some caps, and another task owned by the new uid
tries to change settings for the more privileged task.

Changelog:
Nov 05 2008: (v4) trivial port on top of always-start-\
with-clear-caps patch
Sep 23 2008: nixed file_caps_enabled when file caps are
not compiled in as it isn't used.
Document no_file_caps in kernel-parameters.txt.

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>


# 5cd9c58f 14-Aug-2008 David Howells <dhowells@redhat.com>

security: Fix setting of PF_SUPERPRIV by __capable()

Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
the target process if that is not the current process and it is trying to
change its own flags in a different way at the same time.

__capable() is using neither atomic ops nor locking to protect t->flags. This
patch removes __capable() and introduces has_capability() that doesn't set
PF_SUPERPRIV on the process being queried.

This patch further splits security_ptrace() in two:

(1) security_ptrace_may_access(). This passes judgement on whether one
process may access another only (PTRACE_MODE_ATTACH for ptrace() and
PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
current is the parent.

(2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only,
and takes only a pointer to the parent process. current is the child.

In Smack and commoncap, this uses has_capability() to determine whether
the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
This does not set PF_SUPERPRIV.

Two of the instances of __capable() actually only act on current, and so have
been changed to calls to capable().

Of the places that were using __capable():

(1) The OOM killer calls __capable() thrice when weighing the killability of a
process. All of these now use has_capability().

(2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
whether the parent was allowed to trace any process. As mentioned above,
these have been split. For PTRACE_ATTACH and /proc, capable() is now
used, and for PTRACE_TRACEME, has_capability() is used.

(3) cap_safe_nice() only ever saw current, so now uses capable().

(4) smack_setprocattr() rejected accesses to tasks other than current just
after calling __capable(), so the order of these two tests have been
switched and capable() is used instead.

(5) In smack_file_send_sigiotask(), we need to allow privileged processes to
receive SIGIO on files they're manipulating.

(6) In smack_task_wait(), we let a process wait for a privileged process,
whether or not the process doing the waiting is privileged.

I've tested this with the LTP SELinux and syscalls testscripts.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>


# 086f7316 04-Jul-2008 Andrew G. Morgan <morgan@kernel.org>

security: filesystem capabilities: fix fragile setuid fixup code

This commit includes a bugfix for the fragile setuid fixup code in the
case that filesystem capabilities are supported (in access()). The effect
of this fix is gated on filesystem capability support because changing
securebits is only supported when filesystem capabilities support is
configured.)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ca05a99a 27-May-2008 Andrew G. Morgan <morgan@kernel.org>

capabilities: remain source compatible with 32-bit raw legacy capability support.

Source code out there hard-codes a notion of what the
_LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the
raw capability system calls capget() and capset(). Its unfortunate, but
true.

Since the confusing header file has been in a released kernel, there is
software that is erroneously using 64-bit capabilities with the semantics
of 32-bit compatibilities. These recently compiled programs may suffer
corruption of their memory when sys_getcap() overwrites more memory than
they are coded to expect, and the raising of added capabilities when using
sys_capset().

As such, this patch does a number of things to clean up the situation
for all. It

1. forces the _LINUX_CAPABILITY_VERSION define to always retain its
legacy value.

2. adopts a new #define strategy for the kernel's internal
implementation of the preferred magic.

3. deprecates v2 capability magic in favor of a new (v3) magic
number. The functionality of v3 is entirely equivalent to v2,
the only difference being that the v2 magic causes the kernel
to log a "deprecated" warning so the admin can find applications
that may be using v2 inappropriately.

[User space code continues to be encouraged to use the libcap API which
protects the application from details like this. libcap-2.10 is the first
to support v3 capabilities.]

Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518.
Thanks to Bojan Smojver for the report.

[akpm@linux-foundation.org: s/depreciate/deprecate/g]
[akpm@linux-foundation.org: be robust about put_user size]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Bojan Smojver <bojan@rexursive.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>


# 780db6c1 29-Apr-2008 David Howells <dhowells@redhat.com>

Security: Typecast CAP_*_SET macros

Cast the CAP_*_SET macros to be of kernel_cap_t type to avoid compiler
warnings.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 25f2ea9f 29-Apr-2008 David Howells <dhowells@redhat.com>

Security: Typecast CAP_*_SET macros

Cast the CAP_*_SET macros to be of kernel_cap_t type to avoid compiler
warnings.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3898b1b4 28-Apr-2008 Andrew G. Morgan <morgan@kernel.org>

capabilities: implement per-process securebits

Filesystem capability support makes it possible to do away with (set)uid-0
based privilege and use capabilities instead. That is, with filesystem
support for capabilities but without this present patch, it is (conceptually)
possible to manage a system with capabilities alone and never need to obtain
privilege via (set)uid-0.

Of course, conceptually isn't quite the same as currently possible since few
user applications, certainly not enough to run a viable system, are currently
prepared to leverage capabilities to exercise privilege. Further, many
applications exist that may never get upgraded in this way, and the kernel
will continue to want to support their setuid-0 base privilege needs.

Where pure-capability applications evolve and replace setuid-0 binaries, it is
desirable that there be a mechanisms by which they can contain their
privilege. In addition to leveraging the per-process bounding and inheritable
sets, this should include suppressing the privilege of the uid-0 superuser
from the process' tree of children.

The feature added by this patch can be leveraged to suppress the privilege
associated with (set)uid-0. This suppression requires CAP_SETPCAP to
initiate, and only immediately affects the 'current' process (it is inherited
through fork()/exec()). This reimplementation differs significantly from the
historical support for securebits which was system-wide, unwieldy and which
has ultimately withered to a dead relic in the source of the modern kernel.

With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
all legacy privilege (through uid=0) for itself and all subsequently
fork()'d/exec()'d children with:

prctl(PR_SET_SECUREBITS, 0x2f);

This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
enabled at configure time.

[akpm@linux-foundation.org: fix uninitialised var warning]
[serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Paul Moore <paul.moore@hp.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e114e473 04-Feb-2008 Casey Schaufler <casey@schaufler-ca.com>

Smack: Simplified Mandatory Access Control Kernel

Smack is the Simplified Mandatory Access Control Kernel.

Smack implements mandatory access control (MAC) using labels
attached to tasks and data containers, including files, SVIPC,
and other tasks. Smack is a kernel based scheme that requires
an absolute minimum of application support and a very small
amount of configuration data.

Smack uses extended attributes and
provides a set of general mount options, borrowing technics used
elsewhere. Smack uses netlabel for CIPSO labeling. Smack provides
a pseudo-filesystem smackfs that is used for manipulation of
system Smack attributes.

The patch, patches for ls and sshd, a README, a startup script,
and x86 binaries for ls and sshd are also available on

http://www.schaufler-ca.com

Development has been done using Fedora Core 7 in a virtual machine
environment and on an old Sony laptop.

Smack provides mandatory access controls based on the label attached
to a task and the label attached to the object it is attempting to
access. Smack labels are deliberately short (1-23 characters) text
strings. Single character labels using special characters are reserved
for system use. The only operation applied to Smack labels is equality
comparison. No wildcards or expressions, regular or otherwise, are
used. Smack labels are composed of printable characters and may not
include "/".

A file always gets the Smack label of the task that created it.

Smack defines and uses these labels:

"*" - pronounced "star"
"_" - pronounced "floor"
"^" - pronounced "hat"
"?" - pronounced "huh"

The access rules enforced by Smack are, in order:

1. Any access requested by a task labeled "*" is denied.
2. A read or execute access requested by a task labeled "^"
is permitted.
3. A read or execute access requested on an object labeled "_"
is permitted.
4. Any access requested on an object labeled "*" is permitted.
5. Any access requested by a task on an object with the same
label is permitted.
6. Any access requested that is explicitly defined in the loaded
rule set is permitted.
7. Any other access is denied.

Rules may be explicitly defined by writing subject,object,access
triples to /smack/load.

Smack rule sets can be easily defined that describe Bell&LaPadula
sensitivity, Biba integrity, and a variety of interesting
configurations. Smack rule sets can be modified on the fly to
accommodate changes in the operating environment or even the time
of day.

Some practical use cases:

Hierarchical levels. The less common of the two usual uses
for MLS systems is to define hierarchical levels, often
unclassified, confidential, secret, and so on. To set up smack
to support this, these rules could be defined:

C Unclass rx
S C rx
S Unclass rx
TS S rx
TS C rx
TS Unclass rx

A TS process can read S, C, and Unclass data, but cannot write it.
An S process can read C and Unclass. Note that specifying that
TS can read S and S can read C does not imply TS can read C, it
has to be explicitly stated.

Non-hierarchical categories. This is the more common of the
usual uses for an MLS system. Since the default rule is that a
subject cannot access an object with a different label no
access rules are required to implement compartmentalization.

A case that the Bell & LaPadula policy does not allow is demonstrated
with this Smack access rule:

A case that Bell&LaPadula does not allow that Smack does:

ESPN ABC r
ABC ESPN r

On my portable video device I have two applications, one that
shows ABC programming and the other ESPN programming. ESPN wants
to show me sport stories that show up as news, and ABC will
only provide minimal information about a sports story if ESPN
is covering it. Each side can look at the other's info, neither
can change the other. Neither can see what FOX is up to, which
is just as well all things considered.

Another case that I especially like:

SatData Guard w
Guard Publish w

A program running with the Guard label opens a UDP socket and
accepts messages sent by a program running with a SatData label.
The Guard program inspects the message to ensure it is wholesome
and if it is sends it to a program running with the Publish label.
This program then puts the information passed in an appropriate
place. Note that the Guard program cannot write to a Publish
file system object because file system semanitic require read as
well as write.

The four cases (categories, levels, mutual read, guardbox) here
are all quite real, and problems I've been asked to solve over
the years. The first two are easy to do with traditonal MLS systems
while the last two you can't without invoking privilege, at least
for a while.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Cc: Joshua Brindle <method@manicmethod.com>
Cc: Paul Moore <paul.moore@hp.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Cc: "Ahmed S. Darwish" <darwish.07@gmail.com>
Cc: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b7391de 04-Feb-2008 Serge E. Hallyn <serue@us.ibm.com>

capabilities: introduce per-process capability bounding set

The capability bounding set is a set beyond which capabilities cannot grow.
Currently cap_bset is per-system. It can be manipulated through sysctl,
but only init can add capabilities. Root can remove capabilities. By
default it includes all caps except CAP_SETPCAP.

This patch makes the bounding set per-process when file capabilities are
enabled. It is inherited at fork from parent. Noone can add elements,
CAP_SETPCAP is required to remove them.

One example use of this is to start a safer container. For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.

The bounding set will not affect pP and pE immediately. It will only
affect pP' and pE' after subsequent exec()s. It also does not affect pI,
and exec() does not constrain pI'. So to really start a shell with no way
of regain CAP_MKNOD, you would do

prctl(PR_CAPBSET_DROP, CAP_MKNOD);
cap_t cap = cap_get_proc();
cap_value_t caparray[1];
caparray[0] = CAP_MKNOD;
cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
cap_set_proc(cap);
cap_free(cap);

The following test program will get and set the bounding
set (but not pI). For instance

./bset get
(lists capabilities in bset)
./bset drop cap_net_raw
(starts shell with new bset)
(use capset, setuid binary, or binary with
file capabilities to try to increase caps)

************************************************************
cap_bound.c
************************************************************
#include <sys/prctl.h>
#include <linux/capability.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#ifndef PR_CAPBSET_READ
#define PR_CAPBSET_READ 23
#endif

#ifndef PR_CAPBSET_DROP
#define PR_CAPBSET_DROP 24
#endif

int usage(char *me)
{
printf("Usage: %s get\n", me);
printf(" %s drop <capability>\n", me);
return 1;
}

#define numcaps 32
char *captable[numcaps] = {
"cap_chown",
"cap_dac_override",
"cap_dac_read_search",
"cap_fowner",
"cap_fsetid",
"cap_kill",
"cap_setgid",
"cap_setuid",
"cap_setpcap",
"cap_linux_immutable",
"cap_net_bind_service",
"cap_net_broadcast",
"cap_net_admin",
"cap_net_raw",
"cap_ipc_lock",
"cap_ipc_owner",
"cap_sys_module",
"cap_sys_rawio",
"cap_sys_chroot",
"cap_sys_ptrace",
"cap_sys_pacct",
"cap_sys_admin",
"cap_sys_boot",
"cap_sys_nice",
"cap_sys_resource",
"cap_sys_time",
"cap_sys_tty_config",
"cap_mknod",
"cap_lease",
"cap_audit_write",
"cap_audit_control",
"cap_setfcap"
};

int getbcap(void)
{
int comma=0;
unsigned long i;
int ret;

printf("i know of %d capabilities\n", numcaps);
printf("capability bounding set:");
for (i=0; i<numcaps; i++) {
ret = prctl(PR_CAPBSET_READ, i);
if (ret < 0)
perror("prctl");
else if (ret==1)
printf("%s%s", (comma++) ? ", " : " ", captable[i]);
}
printf("\n");
return 0;
}

int capdrop(char *str)
{
unsigned long i;

int found=0;
for (i=0; i<numcaps; i++) {
if (strcmp(captable[i], str) == 0) {
found=1;
break;
}
}
if (!found)
return 1;
if (prctl(PR_CAPBSET_DROP, i)) {
perror("prctl");
return 1;
}
return 0;
}

int main(int argc, char *argv[])
{
if (argc<2)
return usage(argv[0]);
if (strcmp(argv[1], "get")==0)
return getbcap();
if (strcmp(argv[1], "drop")!=0 || argc<3)
return usage(argv[0]);
if (capdrop(argv[2])) {
printf("unknown capability\n");
return 1;
}
return execl("/bin/bash", "/bin/bash", NULL);
}
************************************************************

[serue@us.ibm.com: fix typo]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>a
Signed-off-by: "Serge E. Hallyn" <serue@us.ibm.com>
Tested-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 46c383cc 04-Feb-2008 Andrew Morgan <andrew@nagrom.org>

Remove unnecessary include from include/linux/capability.h

KaiGai Kohei observed that this line in the linux header is not needed.

Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: KaiGai Kohei <kaigai@kaigai.gr.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e338d263 04-Feb-2008 Andrew Morgan <morgan@kernel.org>

Add 64-bit capability support to the kernel

The patch supports legacy (32-bit) capability userspace, and where possible
translates 32-bit capabilities to/from userspace and the VFS to 64-bit
kernel space capabilities. If a capability set cannot be compressed into
32-bits for consumption by user space, the system call fails, with -ERANGE.

FWIW libcap-2.00 supports this change (and earlier capability formats)

http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/

[akpm@linux-foundation.org: coding-syle fixes]
[akpm@linux-foundation.org: use get_task_comm()]
[ezk@cs.sunysb.edu: build fix]
[akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
[akpm@linux-foundation.org: unused var]
[serue@us.ibm.com: export __cap_ symbols]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8f6936f4 04-Feb-2008 Andrew Morton <akpm@linux-foundation.org>

revert "capabilities: clean up file capability reading"

Revert b68680e4731abbd78863063aaa0dca2a6d8cc723 to make way for the next
patch: "Add 64-bit capability support to the kernel".

We want to keep the vfs_cap_data.data[] structure, using two 'data's for
64-bit caps (and later three for 96-bit caps), whereas
b68680e4731abbd78863063aaa0dca2a6d8cc723 had gotten rid of the 'data' struct
made its members inline.

The 64-bit caps patch keeps the stack abuse fix at get_file_caps(), which was
the more important part of that patch.

[akpm@linux-foundation.org: coding-style fixes]
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b68680e4 21-Oct-2007 Serge E. Hallyn <serue@us.ibm.com>

capabilities: clean up file capability reading

Simplify the vfs_cap_data structure.

Also fix get_file_caps which was declaring
__le32 v1caps[XATTR_CAPS_SZ] on the stack, but
XATTR_CAPS_SZ is already * sizeof(__le32).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Andrew Morgan <morgan@kernel.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72c2d582 18-Oct-2007 Andrew Morgan <morgan@kernel.org>

V3 file capabilities: alter behavior of cap_setpcap

The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
can change the capabilities of another process, p2. This is not the
meaning that was intended for this capability at all, and this
implementation came about purely because, without filesystem capabilities,
there was no way to use capabilities without one process bestowing them on
another.

Since we now have a filesystem support for capabilities we can fix the
implementation of CAP_SETPCAP.

The most significant thing about this change is that, with it in effect, no
process can set the capabilities of another process.

The capabilities of a program are set via the capability convolution
rules:

pI(post-exec) = pI(pre-exec)
pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
pE(post-exec) = fE ? pP(post-exec) : 0

at exec() time. As such, the only influence the pre-exec() program can
have on the post-exec() program's capabilities are through the pI
capability set.

The correct implementation for CAP_SETPCAP (and that enabled by this patch)
is that it can be used to add extra pI capabilities to the current process
- to be picked up by subsequent exec()s when the above convolution rules
are applied.

Here is how it works:

Let's say we have a process, p. It has capability sets, pE, pP and pI.
Generally, p, can change the value of its own pI to pI' where

(pI' & ~pI) & ~pP = 0.

That is, the only new things in pI' that were not present in pI need to
be present in pP.

The role of CAP_SETPCAP is basically to permit changes to pI beyond
the above:

if (pE & CAP_SETPCAP) {
pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
}

This capability is useful for things like login, which (say, via
pam_cap) might want to raise certain inheritable capabilities for use
by the children of the logged-in user's shell, but those capabilities
are not useful to or needed by the login program itself.

One such use might be to limit who can run ping. You set the
capabilities of the 'ping' program to be "= cap_net_raw+i", and then
only shells that have (pI & CAP_NET_RAW) will be able to run
it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
would have to also have (pP & CAP_NET_RAW) in order to raise this
capability and pass it on through the inheritable set.

Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5376771 17-Oct-2007 Serge E. Hallyn <serue@us.ibm.com>

Implement file posix capabilities

Implement file posix capabilities. This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.

This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php. For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.

Changelog:
Nov 27:
Incorporate fixes from Andrew Morton
(security-introduce-file-caps-tweaks and
security-introduce-file-caps-warning-fix)
Fix Kconfig dependency.
Fix change signaling behavior when file caps are not compiled in.

Nov 13:
Integrate comments from Alexey: Remove CONFIG_ ifdef from
capability.h, and use %zd for printing a size_t.

Nov 13:
Fix endianness warnings by sparse as suggested by Alexey
Dobriyan.

Nov 09:
Address warnings of unused variables at cap_bprm_set_security
when file capabilities are disabled, and simultaneously clean
up the code a little, by pulling the new code into a helper
function.

Nov 08:
For pointers to required userspace tools and how to use
them, see http://www.friedhoff.org/fscaps.html.

Nov 07:
Fix the calculation of the highest bit checked in
check_cap_sanity().

Nov 07:
Allow file caps to be enabled without CONFIG_SECURITY, since
capabilities are the default.
Hook cap_task_setscheduler when !CONFIG_SECURITY.
Move capable(TASK_KILL) to end of cap_task_kill to reduce
audit messages.

Nov 05:
Add secondary calls in selinux/hooks.c to task_setioprio and
task_setscheduler so that selinux and capabilities with file
cap support can be stacked.

Sep 05:
As Seth Arnold points out, uid checks are out of place
for capability code.

Sep 01:
Define task_setscheduler, task_setioprio, cap_task_kill, and
task_setnice to make sure a user cannot affect a process in which
they called a program with some fscaps.

One remaining question is the note under task_setscheduler: are we
ok with CAP_SYS_NICE being sufficient to confine a process to a
cpuset?

It is a semantic change, as without fsccaps, attach_task doesn't
allow CAP_SYS_NICE to override the uid equivalence check. But since
it uses security_task_setscheduler, which elsewhere is used where
CAP_SYS_NICE can be used to override the uid equivalence check,
fixing it might be tough.

task_setscheduler
note: this also controls cpuset:attach_task. Are we ok with
CAP_SYS_NICE being used to confine to a cpuset?
task_setioprio
task_setnice
sys_setpriority uses this (through set_one_prio) for another
process. Need same checks as setrlimit

Aug 21:
Updated secureexec implementation to reflect the fact that
euid and uid might be the same and nonzero, but the process
might still have elevated caps.

Aug 15:
Handle endianness of xattrs.
Enforce capability version match between kernel and disk.
Enforce that no bits beyond the known max capability are
set, else return -EPERM.
With this extra processing, it may be worth reconsidering
doing all the work at bprm_set_security rather than
d_instantiate.

Aug 10:
Always call getxattr at bprm_set_security, rather than
caching it at d_instantiate.

[morgan@kernel.org: file-caps clean up for linux/capability.h]
[bunk@kernel.org: unexport cap_inode_killpriv]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a3021f4 16-Jul-2007 Robert P. J. Day <rpjday@mindspring.com>

Remove unnecessary includes of spinlock.h under include/linux

Remove the obviously unnecessary includes of <linux/spinlock.h> under the
include/linux/ directory, and fix the couple errors that are introduced as
a result of that.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b7add02d 23-May-2007 Andrew Morton <akpm@linux-foundation.org>

capability.h warning fix

include/linux/capability.h:397: warning: "struct task_struct" declared inside parameter list
include/linux/capability.h:397: warning: its scope is only this definition or declaration, which is probably not what you want

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 12b5989b 25-Mar-2006 Chris Wright <chrisw@sous-sol.org>

[PATCH] refactor capable() to one implementation, add __capable() helper

Move capable() to kernel/capability.c and eliminate duplicate
implementations. Add __capable() function which can be used to check for
capabiilty of any process.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c59ede7b 11-Jan-2006 Randy Dunlap <rdunlap@infradead.org>

[PATCH] move capable() to capability.h

- Move capable() from sched.h to capability.h;

- Use <linux/capability.h> where capable() is used
(in include/, block/, ipc/, kernel/, a few drivers/,
mm/, security/, & sound/;
many more drivers/ to go)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bce5f6ba 03-Sep-2005 Martin Hicks <mort@sgi.com>

[PATCH] VM: add capabilites check to set_zone_reclaim

Add a capability check to sys_set_zone_reclaim(). This syscall is not
something that should be available to a user.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!