Cross Reference: /linux-master/security/device

History log of /linux-master/security/device_cgroup.c
Revision	Date	Author	Comments
# 4be22f16	21-Jun-2023	Gaosheng Cui <cuigaosheng1@huawei.com>	device_cgroup: Fix kernel-doc warnings in device_cgroup Fix kernel-doc warnings in device_cgroup: security/device_cgroup.c:835: warning: Excess function parameter 'dev_cgroup' description in 'devcgroup_legacy_check_permission'. Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
# 4432b507	24-May-2023	Paul Moore <paul@paul-moore.com>	lsm: fix a number of misspellings A random collection of spelling fixes for source files in the LSM layer. Reviewed-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
# f89f8e16	03-Mar-2023	Kamalesh Babulal <kamalesh.babulal@oracle.com>	device_cgroup: Fix typo in devcgroup_css_alloc description Fix the stale cgroup.c path in the devcgroup_css_alloc() description. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
# e68bfbd3	25-Oct-2022	Wang Weiyang <wangweiyang2@huawei.com>	device_cgroup: Roll back to original exceptions after copy failure When add the 'a : rwm' entry to devcgroup A's whitelist, at first A's exceptions will be cleaned and A's behavior is changed to DEVCG_DEFAULT_ALLOW. Then parent's exceptions will be copyed to A's whitelist. If copy failure occurs, just return leaving A to grant permissions to all devices. And A may grant more permissions than parent. Backup A's whitelist and recover original exceptions after copy failure. Cc: stable@vger.kernel.org Fixes: 4cef7299b478 ("device_cgroup: add proper checking when changing default behavior") Signed-off-by: Wang Weiyang <wangweiyang2@huawei.com> Reviewed-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
# f10d0596	15-Dec-2021	YiFei Zhu <zhuyifei@google.com>	bpf: Make BPF_PROG_RUN_ARRAY return -err instead of allow boolean Right now BPF_PROG_RUN_ARRAY and related macros return 1 or 0 for whether the prog array allows or rejects whatever is being hooked. The caller of these macros then return -EPERM or continue processing based on thw macro's return value. Unforunately this is inflexible, since -EPERM is the only err that can be returned. This patch should be a no-op; it prepares for the next patch. The returning of the -EPERM is moved to inside the macros, so the outer functions are directly returning what the macros returned if they are non-zero. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/788abcdca55886d1f43274c918eaa9f792a9f33b.1639619851.git.zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
# aef2feda	15-Dec-2021	Jakub Kicinski <kuba@kernel.org>	add missing bpf-cgroup.h includes We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency, make sure those who actually need more than the definition of struct cgroup_bpf include bpf-cgroup.h explicitly. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org
# bc62d68e	06-Apr-2020	Amol Grover <frextrite@gmail.com>	device_cgroup: Fix RCU list debugging warning exceptions may be traversed using list_for_each_entry_rcu() outside of an RCU read side critical section BUT under the protection of decgroup_mutex. Hence add the corresponding lockdep expression to fix the following false-positive warning: [ 2.304417] ============================= [ 2.304418] WARNING: suspicious RCU usage [ 2.304420] 5.5.4-stable #17 Tainted: G E [ 2.304422] ----------------------------- [ 2.304424] security/device_cgroup.c:355 RCU-list traversed in non-reader section!! Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: James Morris <jmorris@namei.org>
# eec8fd02	03-Apr-2020	Odin Ugedal <odin@ugedal.com>	device_cgroup: Cleanup cgroup eBPF device filter code Original cgroup v2 eBPF code for filtering device access made it possible to compile with CONFIG_CGROUP_DEVICE=n and still use the eBPF filtering. Change commit 4b7d4d453fc4 ("device_cgroup: Export devcgroup_check_permission") reverted this, making it required to set it to y. Since the device filtering (and all the docs) for cgroup v2 is no longer a "device controller" like it was in v1, someone might compile their kernel with CONFIG_CGROUP_DEVICE=n. Then (for linux 5.5+) the eBPF filter will not be invoked, and all processes will be allowed access to all devices, no matter what the eBPF filter says. Signed-off-by: Odin Ugedal <odin@ugedal.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# 4b7d4d45	16-May-2019	Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>	device_cgroup: Export devcgroup_check_permission For AMD compute (amdkfd) driver. All AMD compute devices are exported via single device node /dev/kfd. As a result devices cannot be controlled individually using device cgroup. AMD compute devices will rely on its graphics counterpart that exposes /dev/dri/renderN node for each device. For each task (based on its cgroup), KFD driver will check if /dev/dri/renderN node is accessible before exposing it. Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
# da82c92f	27-Jun-2019	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>	docs: cgroup-v1: add it to the admin-guide book Those files belong to the admin guide, so add them. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
# 99c8b231	12-Jun-2019	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>	docs: cgroup-v1: convert docs to ReST and rename to *.rst Convert the cgroup-v1 files to ReST format, in order to allow a later addition to the admin-guide. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
# 0fcc4c8c	18-Mar-2019	Jann Horn <jannh@google.com>	device_cgroup: fix RCU imbalance in error case When dev_exception_add() returns an error (due to a failed memory allocation), make sure that we move the RCU preemption count back to where it was before we were called. We dropped the RCU read lock inside the loop body, so we can't just "break". sparse complains about this, too: $ make -s C=2 security/device_cgroup.o ./include/linux/rcupdate.h:647:9: warning: context imbalance in 'propagate_exception' - unexpected unlock Fixes: d591fb56618f ("device_cgroup: simplify cgroup tree walk in propagate_exception()") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# ec15872d	08-May-2018	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>	docs: fix broken references with multiple hints The script: ./scripts/documentation-file-ref-check --fix Gives multiple hints for broken references on some files. Manually use the one that applies for some files. Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: James Morris <james.morris@microsoft.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Jonathan Corbet <corbet@lwn.net>
# ecf8fecb	05-Nov-2017	Roman Gushchin <guro@fb.com>	device_cgroup: prepare code for bpf-based device controller This is non-functional change to prepare the device cgroup code for adding eBPF-based controller for cgroups v2. The patch performs the following changes: 1) __devcgroup_inode_permission() and devcgroup_inode_mknod() are moving to the device-cgroup.h and converting into static inline. 2) __devcgroup_check_permission() is exported. 3) devcgroup_check_permission() wrapper is introduced to be used by both existing and new bpf-based implementations. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
# 67e306fd	05-Nov-2017	Roman Gushchin <guro@fb.com>	device_cgroup: add DEVCG_ prefix to ACC_* and DEV_* constants Rename device type and access type constants defined in security/device_cgroup.c by adding the DEVCG_ prefix. The reason behind this renaming is to make them global namespace friendly, as they will be moved to the corresponding header file by following patches. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: David S. Miller <davem@davemloft.net> Cc: Tejun Heo <tj@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
# b2441318	01-Nov-2017	Greg Kroah-Hartman <gregkh@linuxfoundation.org>	License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
# dc3a04d5	02-Sep-2015	Paul E. McKenney <paulmck@kernel.org>	security/device_cgroup: Fix RCU_LOCKDEP_WARN() condition f78f5b90c4ff ("rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()") introduced a bug by incorrectly inverting the condition when moving from rcu_lockdep_assert() to RCU_LOCKDEP_WARN(). This commit therefore fixes the inversion. Reported-by: Felipe Balbi <balbi@ti.com> Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Tested-by: Josh Boyer <jwboyer@fedoraproject.org>
# f78f5b90	18-Jun-2015	Paul E. McKenney <paulmck@kernel.org>	rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN() This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for consistency with the WARN() series of macros. This also requires inverting the sense of the conditional, which this commit also does. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org>
# 5577964e	15-Jul-2014	Tejun Heo <tj@kernel.org>	cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes Currently, cgroup_subsys->base_cftypes is used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype arrays and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch renames cgroup_subsys->base_cftypes to cgroup_subsys->legacy_cftypes. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
# 7a3bb24f	16-May-2014	Tejun Heo <tj@kernel.org>	device_cgroup: use css_has_online_children() instead of has_children() devcgroup_update_access() wants to know whether there are child cgroups which are online and visible to userland and has_children() may return false positive. Replace it with css_has_online_children(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com>
# 5877019d	16-May-2014	Tejun Heo <tj@kernel.org>	device_cgroup: remove direct access to cgroup->children Currently, devcg::has_children() directly tests cgroup->children for list emptiness. The field is not a published field and scheduled to go away. In addition, the test isn't strictly correct as devcg should only care about children which are visible to userland. This patch converts has_children() to use css_next_child() instead. The subtle incorrectness is noted and will be dealt with later. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com>
# 5c9d535b	16-May-2014	Tejun Heo <tj@kernel.org>	cgroup: remove css_parent() cgroup in general is moving towards using cgroup_subsys_state as the fundamental structural component and css_parent() was introduced to convert from using cgroup->parent to css->parent. It was quite some time ago and we're moving forward with making css more prominent. This patch drops the trivial wrapper css_parent() and let the users dereference css->parent. While at it, explicitly mark fields of css which are public and immutable. v2: New usage from device_cgroup.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Johannes Weiner <hannes@cmpxchg.org>
# 451af504	12-May-2014	Tejun Heo <tj@kernel.org>	cgroup: replace cftype->write_string() with cftype->write() Convert all cftype->write_string() users to the new cftype->write() which maps directly to kernfs write operation and has full access to kernfs and cgroup contexts. The conversions are mostly mechanical. * @css and @cft are accessed using of_css() and of_cft() accessors respectively instead of being specified as arguments. * Should return @nbytes on success instead of 0. * @buf is not trimmed automatically. Trim if necessary. Note that blkcg and netprio don't need this as the parsers already handle whitespaces. cftype->write_string() has no user left after the conversions and removed. While at it, remove unnecessary local variable @p in cgroup_subtree_control_write() and stale comment about CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c. This patch doesn't introduce any visible behavior changes. v2: netprio was missing from conversion. Converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net>
# d2c2b11c	05-May-2014	Aristeu Rozanski <aris@redhat.com>	device_cgroup: check if exception removal is allowed [PATCH v3 1/2] device_cgroup: check if exception removal is allowed When the device cgroup hierarchy was introduced in bd2953ebbb53 - devcg: propagate local changes down the hierarchy a specific case was overlooked. Consider the hierarchy bellow: A default policy: ALLOW, exceptions will deny access \ B default policy: ALLOW, exceptions will deny access There's no need to verify when an new exception is added to B because in this case exceptions will deny access to further devices, which is always fine. Hierarchy in device cgroup only makes sure B won't have more access than A. But when an exception is removed (by writing devices.allow), it isn't checked if the user is in fact removing an inherited exception from A, thus giving more access to B. Example: # echo 'a' >A/devices.allow # echo 'c 1:3 rw' >A/devices.deny # echo $$ >A/B/tasks # echo >/dev/null -bash: /dev/null: Operation not permitted # echo 'c 1:3 w' >A/B/devices.allow # echo >/dev/null # This shouldn't be allowed and this patch fixes it by making sure to never allow exceptions in this case to be removed if the exception is partially or fully present on the parent. v3: missing '*' in function description v2: improved log message and formatting fixes Cc: cgroups@vger.kernel.org Cc: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# f5f3cf6f	24-Apr-2014	Aristeu Rozanski <aris@redhat.com>	device_cgroup: fix the comment format for recently added functions Moving more extensive explanations to the end of the comment. Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# 79d71974	20-Apr-2014	Aristeu Rozanski <aris@redhat.com>	device_cgroup: rework device access check and exception checking Whenever a device file is opened and checked against current device cgroup rules, it uses the same function (may_access()) as when a new exception rule is added by writing devices.{allow,deny}. And in both cases, the algorithm is the same, doesn't matter the behavior. First problem is having device access to be considered the same as rule checking. Consider the following structure: A (default behavior: allow, exceptions disallow access) \ B (default behavior: allow, exceptions disallow access) A new exception is added to B by writing devices.deny: c 12:34 rw When checking if that exception is allowed in may_access(): if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) { if (behavior == DEVCG_DEFAULT_ALLOW) { /* the exception will deny access to certain devices / return true; Which is ok, since B is not getting more privileges than A, it doesn't matter and the rule is accepted Now, consider it's a device file open check and the process belongs to cgroup B. The access will be generated as: behavior: allow exception: c 12:34 rw The very same chunk of code will allow it, even if there's an explicit exception telling to do otherwise. A simple test case: # mkdir new_group # cd new_group # echo $$ >tasks # echo "c 1:3 w" >devices.deny # echo >/dev/null # echo $? 0 This is a serious bug and was introduced on c39a2a3018f8 devcg: prepare may_access() for hierarchy support To solve this problem, the device file open function was split from the new exception check. Second problem is how exceptions are processed by may_access(). The first part of the said function tries to match fully with an existing exception: list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) { if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK)) continue; if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR)) continue; if (ex->major != ~0 && ex->major != refex->major) continue; if (ex->minor != ~0 && ex->minor != refex->minor) continue; if (refex->access & (~ex->access)) continue; match = true; break; } That means the new exception should be contained into an existing one to be considered a match: New exception Existing match? notes b 12:34 rwm b 12:34 rwm yes b 12:34 r b :34 rw yes b 12:34 rw b 12:34 w no extra "r" b :34 rw b 12:34 rw no too broad "" b :34 rw b :34 rwm yes Which is fine in some cases. Consider: A (default behavior: deny, exceptions allow access) \ B (default behavior: deny, exceptions allow access) In this case the full match makes sense, the new exception cannot add more access than the parent allows But this doesn't always work, consider: A (default behavior: allow, exceptions disallow access) \ B (default behavior: deny, exceptions allow access) In this case, a new exception in B shouldn't match any of the exceptions in A, after all you can't allow something that was forbidden by A. But consider this scenario: New exception Existing in A match? outcome b 12:34 rw b 12:34 r no exception is accepted Because the new exception has "w" as extra, it doesn't match, so it'll be added to B's exception list. The same problem can happen during a file access check. Consider a cgroup with allow as default behavior: Access Exception match? b 12:34 rw b 12:34 r no In this case, the access didn't match any of the exceptions in the cgroup, which is required since exceptions will disallow access. To solve this problem, two new functions were created to match an exception either fully or partially. In the example above, a partial check will be performed and it'll produce a match since at least "b 12:34 r" from "b 12:34 rw" access matches. Cc: cgroups@vger.kernel.org Cc: Tejun Heo <tj@kernel.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# 4d3bb511	19-Mar-2014	Tejun Heo <tj@kernel.org>	cgroup: drop const from @buffer of cftype->write_string() cftype->write_string() just passes on the writeable buffer from kernfs and there's no reason to add const restriction on the buffer. The only thing const achieves is unnecessarily complicating parsing of the buffer. Drop const from @buffer. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
# 073219e9	08-Feb-2014	Tejun Heo <tj@kernel.org>	cgroup: clean up cgroup_subsys names and initialization cgroup_subsys is a bit messier than it needs to be. * The name of a subsys can be different from its internal identifier defined in cgroup_subsys.h. Most subsystems use the matching name but three - cpu, memory and perf_event - use different ones. * cgroup_subsys_id enums are postfixed with _subsys_id and each cgroup_subsys is postfixed with _subsys. cgroup.h is widely included throughout various subsystems, it doesn't and shouldn't have claim on such generic names which don't have any qualifier indicating that they belong to cgroup. * cgroup_subsys->subsys_id should always equal the matching cgroup_subsys_id enum; however, we require each controller to initialize it and then BUG if they don't match, which is a bit silly. This patch cleans up cgroup_subsys names and initialization by doing the followings. * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each cgroup_subsys with _cgrp_subsys. * With the above, renaming subsys identifiers to match the userland visible names doesn't cause any naming conflicts. All non-matching identifiers are renamed to match the official names. cpu_cgroup -> cpu mem_cgroup -> memory perf -> perf_event * controllers no longer need to initialize ->subsys_id and ->name. They're generated in cgroup core and set automatically during boot. * Redundant cgroup_subsys declarations removed. * While updating BUG_ON()s in cgroup_init_early(), convert them to WARN()s. BUGging that early during boot is stupid - the kernel can't print anything, even through serial console and the trap handler doesn't even link stack frame properly for back-tracing. This patch doesn't introduce any behavior changes. v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs classid handling into core"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Ingo Molnar <mingo@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Thomas Graf <tgraf@suug.ch>
# 2da8ca82	04-Dec-2013	Tejun Heo <tj@kernel.org>	cgroup: replace cftype->read_seq_string() with cftype->seq_show() In preparation of conversion to kernfs, cgroup file handling is updated so that it can be easily mapped to kernfs. This patch replaces cftype->read_seq_string() with cftype->seq_show() which is not limited to single_open() operation and will map directcly to kernfs seq_file interface. The conversions are mechanical. As ->seq_show() doesn't have @css and @cft, the functions which make use of them are converted to use seq_css() and seq_cft() respectively. In several occassions, e.f. if it has seq_string in its name, the function name is updated to fit the new method better. This patch does not introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Neil Horman <nhorman@tuxdriver.com>
# 73ba3534	22-Oct-2013	Serge Hallyn <serge.hallyn@ubuntu.com>	device_cgroup: remove can_attach It is really only wanting to duplicate a check which is already done by the cgroup subsystem. With this patch, user jdoe still cannot move pid 1 into a devices cgroup he owns, but now he can move his own other tasks into devices cgroups. Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com>
# bd8815a6	08-Aug-2013	Tejun Heo <tj@kernel.org>	cgroup: make css_for_each_descendant() and friends include the origin css in the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
# 492eb21b	08-Aug-2013	Tejun Heo <tj@kernel.org>	cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup cgroup is currently in the process of transitioning to using css (cgroup_subsys_state) as the primary handle instead of cgroup in subsystem API. For hierarchy iterators, this is beneficial because * In most cases, css is the only thing subsystems care about anyway. * On the planned unified hierarchy, iterations for different subsystems will need to skip over different subtrees of the hierarchy depending on which subsystems are enabled on each cgroup. Passing around css makes it unnecessary to explicitly specify the subsystem in question as css is intersection between cgroup and subsystem * For the planned unified hierarchy, css's would need to be created and destroyed dynamically independent from cgroup hierarchy. Having cgroup core manage css iteration makes enforcing deref rules a lot easier. Most subsystem conversions are straight-forward. Noteworthy changes are * blkio: cgroup_to_blkcg() is no longer used. Removed. * freezer: cgroup_freezer() is no longer used. Removed. * devices: cgroup_to_devcgroup() is no longer used. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk>
# 182446d0	08-Aug-2013	Tejun Heo <tj@kernel.org>	cgroup: pass around cgroup_subsys_state instead of cgroup in file methods cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
# eb95419b	08-Aug-2013	Tejun Heo <tj@kernel.org>	cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup * in subsystem implementations for the following reasons. * With unified hierarchy, subsystems will be dynamically bound and unbound from cgroups and thus css's (cgroup_subsys_state) may be created and destroyed dynamically over the lifetime of a cgroup, which is different from the current state where all css's are allocated and destroyed together with the associated cgroup. This in turn means that cgroup_css() should be synchronized and may return NULL, making it more cumbersome to use. * Differing levels of per-subsystem granularity in the unified hierarchy means that the task and descendant iterators should behave differently depending on the specific subsystem the iteration is being performed for. * In majority of the cases, subsystems only care about its part in the cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods often obtain the matching css pointer from the cgroup and don't bother with the cgroup pointer itself. Passing around css fits much better. This patch converts all cgroup_subsys methods to take @css instead of @cgroup. The conversions are mostly straight-forward. A few noteworthy changes are * ->css_alloc() now takes css of the parent cgroup rather than the pointer to the new cgroup as the css for the new cgroup doesn't exist yet. Knowing the parent css is enough for all the existing subsystems. * In kernel/cgroup.c::offline_css(), unnecessary open coded css dereference is replaced with local variable access. This patch shouldn't cause any behavior differences. v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced with local variable @css as suggested by Li Zefan. Rebased on top of new for-3.12 which includes for-3.11-fixes so that ->css_free() invocation added by da0a12caff ("cgroup: fix a leak when percpu_ref_init() fails") is converted too. Suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
# 63876986	08-Aug-2013	Tejun Heo <tj@kernel.org>	cgroup: add css_parent() Currently, controllers have to explicitly follow the cgroup hierarchy to find the parent of a given css. cgroup is moving towards using cgroup_subsys_state as the main controller interface construct, so let's provide a way to climb the hierarchy using just csses. This patch implements css_parent() which, given a css, returns its parent. The function is guarnateed to valid non-NULL parent css as long as the target css is not at the top of the hierarchy. freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices are converted to use css_parent() instead of accessing cgroup->parent directly. * __parent_ca() is dropped from cpuacct and its usage is replaced with parent_ca(). The only difference between the two was NULL test on cgroup->parent which is now embedded in css_parent() making the distinction moot. Note that eventually a css->parent field will be added to css and the NULL check in css_parent() will go away. This patch shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
# a7c6d554	08-Aug-2013	Tejun Heo <tj@kernel.org>	cgroup: add/update accessors which obtain subsys specific data from css css (cgroup_subsys_state) is usually embedded in a subsys specific data structure. Subsystems either use container_of() directly to cast from css to such data structure or has an accessor function wrapping such cast. As cgroup as whole is moving towards using css as the main interface handle, add and update such accessors to ease dealing with css's. All accessors explicitly handle NULL input and return NULL in those cases. While this looks like an extra branch in the code, as all controllers specific data structures have css as the first field, the casting doesn't involve any offsetting and the compiler can trivially optimize out the branch. * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such accessor. Added. * memory, hugetlb and devices already had one but didn't explicitly handle NULL input. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
# 8af01f56	08-Aug-2013	Tejun Heo <tj@kernel.org>	cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ The names of the two struct cgroup_subsys_state accessors - cgroup_subsys_state() and task_subsys_state() - are somewhat awkward. The former clashes with the type name and the latter doesn't even indicate it's somehow related to cgroup. We're about to revamp large portion of cgroup API, so, let's rename them so that they're less awkward. Most per-controller usages of the accessors are localized in accessor wrappers and given the amount of scheduled changes, this isn't gonna add any noticeable headache. Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state() to task_css(). This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
# d591fb56	23-May-2013	Tejun Heo <tj@kernel.org>	device_cgroup: simplify cgroup tree walk in propagate_exception() During a config change, propagate_exception() needs to traverse the subtree to update config on the subtree. Because such config updates need to allocate memory, it couldn't directly use cgroup_for_each_descendant_pre() which required the whole iteration to be contained in a single RCU read critical section. To work around the limitation, propagate_exception() built a linked list of descendant cgroups while read-locking RCU and then walked the list afterwards, which is safe as the whole iteration is protected by devcgroup_mutex. This works but is cumbersome. With the recent updates, cgroup iterators now allow dropping RCU read lock while iteration is in progress making this workaround no longer necessary. This patch replaces dev_cgroup->propagate_pending list and get_online_devcg() with direct cgroup_for_each_descendant_pre() walk. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz>
# e57d5cf2	16-Apr-2013	Rami Rosen <ramirose@gmail.com>	devcg: remove parent_cgroup. In devcgroup_css_alloc(), there is no longer need for parent_cgroup. bd2953ebbb("devcg: propagate local changes down the hierarchy") made the variable parent_cgroup redundant. This patch removes parent_cgroup from devcgroup_css_alloc(). Signed-off-by: Rami Rosen <ramirose@gmail.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# 8adf12b0	07-Apr-2013	Tejun Heo <tj@kernel.org>	devcg: remove broken_hierarchy tag bd2953ebbb ("devcg: propagate local changes down the hierarchy") implemented proper hierarchy support. Remove the broken tag. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com>
# bd2953eb	15-Feb-2013	Aristeu Rozanski <aris@redhat.com>	devcg: propagate local changes down the hierarchy This patch makes exception changes to propagate down in hierarchy respecting when possible local exceptions. New exceptions allowing additional access to devices won't be propagated, but it'll be possible to add an exception to access all of part of the newly allowed device(s). New exceptions disallowing access to devices will be propagated down and the local group's exceptions will be revalidated for the new situation. Example: A / \ B group behavior exceptions A allow "b 8:* rwm", "c 116:1 rw" B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" If a new exception is added to group A: # echo "c 116:* r" > A/devices.deny it'll propagate down and after revalidating B's local exceptions, the exception "c 116:2 rwm" will be removed. In case parent's exceptions change and local exceptions are not allowed anymore, they'll be deleted. v7: - do not allow behavior change when the cgroup has children - update documentation v6: fixed issues pointed by Serge Hallyn - only copy parent's exceptions while propagating behavior if the local behavior is different - while propagating exceptions, do not clear and copy parent's: it'd be against the premise we don't propagate access to more devices v5: fixed issues pointed by Serge Hallyn - updated documentation - not propagating when an exception is written to devices.allow - when propagating a new behavior, clean the local exceptions list if they're for a different behavior v4: fixed issues pointed by Tejun Heo - separated function to walk the tree and collect valid propagation targets v3: fixed issues pointed by Tejun Heo - update documentation - move css_online/css_offline changes to a new patch - use cgroup_for_each_descendant_pre() instead of own descendant walk - move exception_copy rework to a separared patch - move exception_clean rework to a separated patch v2: fixed issues pointed by Tejun Heo - instead of keeping the local settings that won't apply anymore, remove them Cc: Tejun Heo <tj@kernel.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# 1909554c	15-Feb-2013	Aristeu Rozanski <aris@redhat.com>	devcg: use css_online and css_offline Allocate resources and change behavior only when online. This is needed in order to determine if a node is suitable for hierarchy propagation or if it's being removed. Locking: Both functions take devcgroup_mutex to make changes to device_cgroup structure. Hierarchy propagation will also take devcgroup_mutex before walking the tree while walking the tree itself is protected by rcu lock. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# c39a2a30	15-Feb-2013	Aristeu Rozanski <aris@redhat.com>	devcg: prepare may_access() for hierarchy support Currently may_access() is only able to verify if an exception is valid for the current cgroup, which has the same behavior. With hierarchy, it'll be also used to verify if a cgroup local exception is valid towards its cgroup parent, which might have different behavior. v2: - updated patch description - rebased on top of a new patch to expand the may_access() logic to make it more clear - fixed argument description order in may_access() Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# 26898fdf	15-Feb-2013	Aristeu Rozanski <aris@redhat.com>	devcg: expand may_access() logic In order to make the next patch more clear, expand may_access() logic. v2: may_access() returns bool now Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# 53eb8c82	21-Feb-2013	Jerry Snitselaar <jerry.snitselaar@oracle.com>	device_cgroup: don't grab mutex in rcu callback Commit 103a197c0c4e ("security/device_cgroup: lock assert fails in dev_exception_clean()") grabs devcgroup_mutex to fix assert failure, but a mutex can't be grabbed in rcu callback. Since there shouldn't be any other references when css_free is called, mutex isn't needed for list cleanup in devcgroup_css_free(). Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 103a197c	17-Jan-2013	Jerry Snitselaar <jerry.snitselaar@oracle.com>	security/device_cgroup: lock assert fails in dev_exception_clean() devcgroup_css_free() calls dev_exception_clean() without the devcgroup_mutex being locked. Shutting down a kvm virt was giving me the following trace: [36280.732764] ------------[ cut here ]------------ [36280.732778] WARNING: at /home/snits/dev/linux/security/device_cgroup.c:172 dev_exception_clean+0xa9/0xc0() [36280.732782] Hardware name: Studio XPS 8100 [36280.732785] Modules linked in: xt_REDIRECT fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc nf_conntrack_ipv4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter it87 hwmon_vid xt_state nf_conntrack ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq coretemp snd_seq_device crc32c_intel snd_pcm snd_page_alloc snd_timer snd broadcom tg3 serio_raw i7core_edac edac_core ptp pps_core lpc_ich pcspkr mfd_core soundcore microcode i2c_i801 nfsd auth_rpcgss nfs_acl lockd vhost_net sunrpc tun macvtap macvlan kvm_intel kvm uinput binfmt_misc autofs4 usb_storage firewire_ohci firewire_core crc_itu_t radeon drm_kms_helper ttm [36280.732921] Pid: 933, comm: libvirtd Tainted: G W 3.8.0-rc3-00307-g4c217de #1 [36280.732922] Call Trace: [36280.732927] [<ffffffff81044303>] warn_slowpath_common+0x93/0xc0 [36280.732930] [<ffffffff8104434a>] warn_slowpath_null+0x1a/0x20 [36280.732932] [<ffffffff812deaf9>] dev_exception_clean+0xa9/0xc0 [36280.732934] [<ffffffff812deb2a>] devcgroup_css_free+0x1a/0x30 [36280.732938] [<ffffffff810ccd76>] cgroup_diput+0x76/0x210 [36280.732941] [<ffffffff8119eac0>] d_delete+0x120/0x180 [36280.732943] [<ffffffff81195cff>] vfs_rmdir+0xef/0x130 [36280.732945] [<ffffffff81195e47>] do_rmdir+0x107/0x1c0 [36280.732949] [<ffffffff8132d17e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [36280.732951] [<ffffffff81198646>] sys_rmdir+0x16/0x20 [36280.732954] [<ffffffff8173bd82>] system_call_fastpath+0x16/0x1b [36280.732956] ---[ end trace ca39dced899a7d9f ]--- Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com> Cc: stable@kernel.org Signed-off-by: James Morris <james.l.morris@oracle.com>
# 92fb9748	19-Nov-2012	Tejun Heo <tj@kernel.org>	cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
# 4b1c7840	06-Nov-2012	Tejun Heo <tj@kernel.org>	device_cgroup: add lockdep asserts device_cgroup uses RCU safe ->exceptions list which is write-protected by devcgroup_mutex and has had some issues using locking correctly. Add lockdep asserts to utility functions so that future errors can be easily detected. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Li Zefan <lizefan@huawei.com>
# 201e72ac	06-Nov-2012	Tejun Heo <tj@kernel.org>	device_cgroup: fix RCU usage dev_cgroup->exceptions is protected with devcgroup_mutex for writes and RCU for reads; however, RCU usage isn't correct. * dev_exception_clean() doesn't use RCU variant of list_del() and kfree(). The function can race with may_access() and may_access() may end up dereferencing already freed memory. Use list_del_rcu() and kfree_rcu() instead. * may_access() may be called only with RCU read locked but doesn't use RCU safe traversal over ->exceptions. Use list_for_each_entry_rcu(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Cc: stable@vger.kernel.org Cc: Aristeu Rozanski <aris@redhat.com> Cc: Li Zefan <lizefan@huawei.com>
# 64e10477	06-Nov-2012	Aristeu Rozanski <aris@redhat.com>	device_cgroup: fix unchecked cgroup parent usage In 4cef7299b478687 ("device_cgroup: add proper checking when changing default behavior") the cgroup parent usage is unchecked. root will not have a parent and trying to use device.{allow,deny} will cause problems. For some reason my stressing scripts didn't test the root directory so I didn't catch it on my regular tests. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
# 4cef7299	25-Oct-2012	Aristeu Rozanski <aris@redhat.com>	device_cgroup: add proper checking when changing default behavior Before changing a group's default behavior to ALLOW, we must check if its parent's behavior is also ALLOW. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 26fd8405	25-Oct-2012	Aristeu Rozanski <aris@redhat.com>	device_cgroup: stop using simple_strtoul() Convert the code to use kstrtou32() instead of simple_strtoul() which is deprecated. The real size of the variables are u32, so use kstrtou32 instead of kstrtoul Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 5b7aa7d5	25-Oct-2012	Aristeu Rozanski <aris@redhat.com>	device_cgroup: rename deny_all to behavior This was done in a v2 patch but v1 ended up being committed. The variable name is less confusing and stores the default behavior when no matching exception exists. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 8c9506d1	25-Oct-2012	Jiri Slaby <jirislaby@kernel.org>	cgroup: fix invalid rcu dereference Commit ad676077a2ae ("device_cgroup: convert device_cgroup internally to policy + exceptions") removed rcu locks which are needed in task_devcgroup called in this chain: devcgroup_inode_mknod OR __devcgroup_inode_permission -> __devcgroup_inode_permission -> task_devcgroup -> task_subsys_state -> task_subsys_state_check. Change the code so that task_devcgroup is safely called with rcu read lock held. =============================== [ INFO: suspicious RCU usage. ] 3.6.0-rc5-next-20120913+ #42 Not tainted ------------------------------- include/linux/cgroup.h:553 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by kdevtmpfs/23: #0: (sb_writers){.+.+.+}, at: [<ffffffff8116873f>] mnt_want_write+0x1f/0x50 #1: (&sb->s_type->i_mutex_key#3/1){+.+.+.}, at: [<ffffffff811558af>] kern_path_create+0x7f/0x170 stack backtrace: Pid: 23, comm: kdevtmpfs Not tainted 3.6.0-rc5-next-20120913+ #42 Call Trace: lockdep_rcu_suspicious+0xfd/0x130 devcgroup_inode_mknod+0x19d/0x240 vfs_mknod+0x71/0xf0 handle_create.isra.2+0x72/0x200 devtmpfsd+0x114/0x140 ? handle_create.isra.2+0x200/0x200 kthread+0xd6/0xe0 kernel_thread_helper+0x4/0x10 Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Dave Jones <davej@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# db9aeca9	04-Oct-2012	Aristeu Rozanski <aris@redhat.com>	device_cgroup: rename whitelist to exception list This patch replaces the "whitelist" usage in the code and comments and replace them by exception list related information. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# ad676077	04-Oct-2012	Aristeu Rozanski <aris@redhat.com>	device_cgroup: convert device_cgroup internally to policy + exceptions The original model of device_cgroup is having a whitelist where all the allowed devices are listed. The problem with this approach is that is impossible to have the case of allowing everything but few devices. The reason for that lies in the way the whitelist is handled internally: since there's only a whitelist, the "all devices" entry would have to be removed and replaced by the entire list of possible devices but the ones that are being denied. Since dev_t is 32 bits long, representing the allowed devices as a bitfield is not memory efficient. This patch replaces the "whitelist" by a "exceptions" list and the default policy is kept as "deny_all" variable in dev_cgroup structure. The current interface determines that whenever "a" is written to devices.allow or devices.deny, the entry masking all devices will be added or removed, respectively. This behavior is kept and it's what will determine the default policy: # cat devices.list a : rwm # echo a >devices.deny # cat devices.list # echo a >devices.allow # cat devices.list a : rwm The interface is also preserved. For example, if one wants to block only access to /dev/null: # ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Jul 24 16:17 /dev/null # echo a >devices.allow # echo "c 1:3 rwm" >devices.deny # cat /dev/null cat: /dev/null: Operation not permitted # echo >/dev/null bash: /dev/null: Operation not permitted mknod /tmp/null c 1 3 mknod: `/tmp/null': Operation not permitted # echo "c 1:3 r" >devices.allow # cat /dev/null # echo >/dev/null bash: /dev/null: Operation not permitted mknod /tmp/null c 1 3 mknod: `/tmp/null': Operation not permitted # echo "c 1:3 rw" >devices.allow # echo >/dev/null # cat /dev/null # mknod /tmp/null c 1 3 mknod: `/tmp/null': Operation not permitted # echo "c 1:3 rwm" >devices.allow # echo >/dev/null # cat /dev/null # mknod /tmp/null c 1 3 # Note that I didn't rename the functions/variables in this patch, but in the next one to make reviewing easier. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 868539a3	04-Oct-2012	Aristeu Rozanski <aris@redhat.com>	device_cgroup: introduce dev_whitelist_clean() This function cleans all the items in a whitelist and will be used by the next patches. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 66b8ef67	04-Oct-2012	Aristeu Rozanski <aris@redhat.com>	device_cgroup: add "deny_all" in dev_cgroup structure deny_all will determine if the default policy is to deny all device access unless for the ones in the exception list. This variable will be used in the next patches to convert device_cgroup internally into a default policy + rules. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 8c7f6edb	13-Sep-2012	Tejun Heo <tj@kernel.org>	cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them Currently, cgroup hierarchy support is a mess. cpu related subsystems behave correctly - configuration, accounting and control on a parent properly cover its children. blkio and freezer completely ignore hierarchy and treat all cgroups as if they're directly under the root cgroup. Others show yet different behaviors. These differing interpretations of cgroup hierarchy make using cgroup confusing and it impossible to co-mount controllers into the same hierarchy and obtain sane behavior. Eventually, we want full hierarchy support from all subsystems and probably a unified hierarchy. Users using separate hierarchies expecting completely different behaviors depending on the mounted subsystem is deterimental to making any progress on this front. This patch adds cgroup_subsys.broken_hierarchy and sets it to %true for controllers which are lacking in hierarchy support. The goal of this patch is two-fold. * Move users away from using hierarchy on currently non-hierarchical subsystems, so that implementing proper hierarchy support on those doesn't surprise them. * Keep track of which controllers are broken how and nudge the subsystems to implement proper hierarchy support. For now, start with a single warning message. We can whine louder later on. v2: Fixed a typo spotted by Michal. Warning message updated. v3: Updated memcg part so that it doesn't generate warning in the cases where .use_hierarchy=false doesn't make the behavior different from root.use_hierarchy=true. Fixed a typo spotted by Glauber. v4: Check ->broken_hierarchy after cgroup creation is complete so that ->create() can affect the result per Michal. Dropped unnecessary memcg root handling per Michal. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
# 4baf6e33	01-Apr-2012	Tejun Heo <tj@kernel.org>	cgroup: convert all non-memcg controllers to the new cftype interface Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio, net_cls and device controllers to use the new cftype based interface. Termination entry is added to cftype arrays and populate callbacks are replaced with cgroup_subsys->base_cftypes initializations. This is functionally identical transformation. There shouldn't be any visible behavior change. memcg is rather special and will be converted separately. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vivek Goyal <vgoyal@redhat.com>
# 761b3ef5	30-Jan-2012	Li Zefan <lizf@cn.fujitsu.com>	cgroup: remove cgroup_subsys argument from callbacks The argument is not used at all, and it's not necessary, because a specific callback handler of course knows which subsys it belongs to. Now only ->pupulate() takes this argument, because the handlers of this callback always call cgroup_add_file()/cgroup_add_files(). So we reduce a few lines of code, though the shrinking of object size is minimal. 16 files changed, 113 insertions(+), 162 deletions(-) text data bss dec hex filename 5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig 5486170 656987 7039960 13183117 c9288d vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
# 2f7ee569	12-Dec-2011	Tejun Heo <tj@kernel.org>	cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() Currently, there's no way to pass multiple tasks to cgroup_subsys methods necessitating the need for separate per-process and per-task methods. This patch introduces cgroup_taskset which can be used to pass multiple tasks and their associated cgroups to cgroup_subsys methods. Three methods - can_attach(), cancel_attach() and attach() - are converted to use cgroup_taskset. This unifies passed parameters so that all methods have access to all information. Conversions in this patchset are identical and don't introduce any behavior change. -v2: documentation updated as per Paul Menage's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: James Morris <jmorris@namei.org>
# 6034f7e6	15-Mar-2011	Lai Jiangshan <laijs@cn.fujitsu.com>	security,rcu: Convert call_rcu(whitelist_item_free) to kfree_rcu() The rcu callback whitelist_item_free() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(whitelist_item_free). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: James Morris <jmorris@namei.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
# 482e0cd3	19-Jun-2011	Al Viro <viro@zeniv.linux.org.uk>	devcgroup_inode_permission: take "is it a device node" checks to inlined wrapper inode_permission() calls devcgroup_inode_permission() and almost all such calls are _not_ for device nodes; let's at least keep the common path straight... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# f780bdb7	26-May-2011	Ben Blum <bblum@andrew.cmu.edu>	cgroups: add per-thread subsystem callbacks Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# c5b60b5e	21-Apr-2010	Justin P. Mattock <justinmattock@gmail.com>	security: whitespace coding style fixes Whitespace coding style fixes. Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Signed-off-by: James Morris <jmorris@namei.org>
# 5a0e3ad6	24-Mar-2010	Tejun Heo <tj@kernel.org>	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
# be367d09	23-Sep-2009	Ben Blum <bblum@google.com>	cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time Alter the ss->can_attach and ss->attach functions to be able to deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a pre-patch to cgroup-procs-writable.patch.) Currently, new mode of the attach function can only tell the subsystem about the old cgroup of the threadgroup leader. No subsystem currently needs that information for each thread that's being moved, but if one were to be added (for example, one that counts tasks within a group) this bit would need to be reworked a bit to tell the subsystem the right information. [hidave.darkstar@gmail.com: fix build] Signed-off-by: Ben Blum <bblum@google.com> Signed-off-by: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# cd500819	17-Jun-2009	Li Zefan <lizf@cn.fujitsu.com>	devcgroup: skip superfluous checks when found the DEV_ALL elem While walking through the whitelist, if the DEV_ALL item is found, no more check is needed. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# b4046f00	02-Apr-2009	Li Zefan <lizf@cn.fujitsu.com>	devcgroup: avoid using cgroup_lock There is nothing special that has to be protected by cgroup_lock, so introduce devcgroup_mtuex for it's own use. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 0b82ac37	07-Jan-2009	Serge E. Hallyn <serue@us.ibm.com>	devices cgroup: allow mkfifo The devcgroup_inode_permission() hook in the devices whitelist cgroup has always bypassed access checks on fifos. But the mknod hook did not. The devices whitelist is only about block and char devices, and fifos can't even be added to the whitelist, so fifos can't be created at all except by tasks which have 'a' in their whitelist (meaning they have access to all devices). Fix the behavior by bypassing access checks to mkfifo. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: James Morris <jmorris@namei.org> Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com> Cc: <stable@kernel.org> [2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 116e0575	07-Jan-2009	Lai Jiangshan <laijs@cn.fujitsu.com>	devcgroup: use list_for_each_entry_rcu() We should use list_for_each_entry_rcu in RCU read site. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 47c59803	18-Oct-2008	Lai Jiangshan <laijs@cn.fujitsu.com>	devcgroup: remove spin_lock() Since we introduced rcu for read side, spin_lock is used only for update. But we always hold cgroup_lock() when update, so spin_lock() is not need. Additional cleanup: 1) include linux/rcupdate.h explicitly 2) remove unused variable cur_devcgroup in devcgroup_update_access() Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# c012a54a	18-Oct-2008	Li Zefan <lizf@cn.fujitsu.com>	devcgroup: remove unused variable Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 2cdc7241	18-Oct-2008	Li Zefan <lizf@cn.fujitsu.com>	devcgroup: use kmemdup() This saves 40 bytes on my x86_32 box. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 36fd71d2	02-Sep-2008	Li Zefan <lizf@cn.fujitsu.com>	devcgroup: fix race against rmdir() During the use of a dev_cgroup, we should guarantee the corresponding cgroup won't be deleted (i.e. via rmdir). This can be done through css_get(&dev_cgroup->css), but here we can just get and use the dev_cgroup under rcu_read_lock. And also remove checking NULL dev_cgroup, it won't be NULL since a task always belongs to a cgroup. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 7759fc9d	25-Jul-2008	Li Zefan <lizf@cn.fujitsu.com>	devcgroup: code cleanup - clean up set_majmin() - use simple_strtoul() to parse major/minor [akpm@linux-foundation.org: fix simple_strtoul() usage] [kosaki.motohiro@jp.fujitsu.com: fix warnings] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 4efd1a1b	25-Jul-2008	Pavel Emelyanov <xemul@openvz.org>	devcgroup: relax white-list protection down to RCU Currently this list is protected with a simple spinlock, even for reading from one. This is OK, but can be better. Actually I want it to be better very much, since after replacing the OpenVZ device permissions engine with the cgroup-based one I noticed, that we set 12 default device permissions for each newly created container (for /dev/null, full, terminals, ect devices), and people sometimes have up to 20 perms more, so traversing the ~30-40 elements list under a spinlock doesn't seem very good. Here's the RCU protection for white-list - dev_whitelist_item-s are added and removed under the devcg->lock, but are looked up in permissions checking under the rcu_read_lock. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# f92523e3	25-Jul-2008	Paul Menage <menage@google.com>	cgroup files: convert devcgroup_access_write() into a cgroup write_string() handler This patch converts devcgroup_access_write() from a raw file handler into a handler for the cgroup write_string() method. This allows some boilerplate copying/locking/checking to be removed and simplifies the cleanup path, since these functions are performed by the cgroups framework before calling the handler. Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# ec229e83	13-Jul-2008	Li Zefan <lizf@cn.fujitsu.com>	devcgroup: fix permission check when adding entry to child cgroup # cat devices.list c 1:3 r # echo 'c 1:3 w' > sub/devices.allow # cat sub/devices.list c 1:3 w As illustrated, the parent group has no write permission to /dev/null, so it's child should not be allowed to add this write permission. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 17d213f8	13-Jul-2008	Li Zefan <lizf@cn.fujitsu.com>	devcgroup: always show positive major/minor num # echo "b $((0x7fffffff)):$((0x80000000)) rwm" > devices.allow # cat devices.list b 214748364:-21474836 rwm though a major/minor number of 0x800000000 is meaningless, we should not cast it to a negative value. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# d823f6bf	04-Jul-2008	Li Zefan <lizf@cn.fujitsu.com>	devcgroup: fix odd behaviour when writing 'a' to devices.allow # cat /devcg/devices.list a : rwm # echo a > devices.allow # cat /devcg/devices.list a : rwm a 0:0 rwm This is odd and maybe confusing. With this patch, writing 'a' to devices.allow will add 'a : rwm' to the whitelist. Also a few fixes and updates to the document. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# d1ee2971	05-Jun-2008	Pavel Emelyanov <xemul@openvz.org>	devscgroup: make white list more compact in some cases Consider you added a 'c foo:bar r' permission to some cgroup and then (a bit later) 'c'foo:bar w' for it. After this you'll see the c foo:bar r c foo:bar w lines in a devices.list file. Another example - consider you added 10 'c foo:bar r' permissions to some cgroup (e.g. by mistake). After this you'll see 10 c foo:bar r lines in a list file. This is weird. This situation also has one more annoying consequence. Having many items in a white list makes permissions checking slower, sine it has to walk a longer list. The proposal is to merge permissions for items, that correspond to the same device. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# cc9cb219	05-Jun-2008	Pavel Emelyanov <xemul@openvz.org>	devscgroup: relax task to dev_cgroup conversion Two functions, that need to get a device_cgroup from a task (they are devcgroup_inode_permission and devcgroup_inode_mknod) make it in a strange way: They get a css_set from task, then a subsys_state from css_set, then a cgroup from the state and then a subsys_state again from the cgroup. Besides, the devices_subsys_id is read from memory, whilst there's a enum-ed constant for it. Optimize this part a bit: 1. Get the subsys_stats form the task and be done - no 2 extra dereferences, 2. Use the device_subsys_id constant, not the value from memory (i.e. one less dereference). Found while preparing 2.6.26 OpenVZ port. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# b66862f7	05-Jun-2008	Pavel Emelyanov <xemul@openvz.org>	devcgroup: make a helper to convert cgroup_subsys_state to devs_cgroup This is just picking the container_of out of cgroup_to_devcgroup into a separate function. This new css_to_devcgroup will be used in the 2nd patch. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 29486df3	29-Apr-2008	Serge E. Hallyn <serue@us.ibm.com>	cgroups: introduce cft->read_seq() Introduce a read_seq() helper in cftype, which uses seq_file to print out lists. Use it in the devices cgroup. Also split devices.allow into two files, so now devices.deny and devices.allow are the ones to use to manipulate the whitelist, while devices.list outputs the cgroup's current whitelist. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 08ce5f16	29-Apr-2008	Serge E. Hallyn <serue@us.ibm.com>	cgroups: implement device whitelist Implement a cgroup to track and enforce open and mknod restrictions on device files. A device cgroup associates a device access whitelist with each cgroup. A whitelist entry has 4 fields. 'type' is a (all), c (char), or b (block). 'all' means it applies to all types and all major and minor numbers. Major and minor are either an integer or * for all. Access is a composition of r (read), w (write), and m (mknod). The root device cgroup starts with rwm to 'all'. A child devcg gets a copy of the parent. Admins can then remove devices from the whitelist or add new entries. A child cgroup can never receive a device access which is denied its parent. However when a device access is removed from a parent it will not also be removed from the child(ren). An entry is added using devices.allow, and removed using devices.deny. For instance echo 'c 1:3 mr' > /cgroups/1/devices.allow allows cgroup 1 to read and mknod the device usually known as /dev/null. Doing echo a > /cgroups/1/devices.deny will remove the default 'a : mrw' entry. CAP_SYS_ADMIN is needed to change permissions or move another task to a new cgroup. A cgroup may not be granted more permissions than the cgroup's parent has. Any task can move itself between cgroups. This won't be sufficient, but we can decide the best way to adequately restrict movement later. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix may-be-used-uninitialized warning] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: James Morris <jmorris@namei.org> Looks-good-to: Pavel Emelyanov <xemul@openvz.org> Cc: Daniel Hokka Zakrisson <daniel@hozac.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>