History log of /linux-master/ipc/mqueue.c
Revision Date Author Comments
# 6e52a9f0 22-Apr-2021 Alexey Gladkov <legion@kernel.org>

Reimplement RLIMIT_MSGQUEUE on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# a11ddb37 22-May-2021 Varad Gautam <varad.gautam@suse.com>

ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 6521f891 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: prepare for idmapped mounts

The various vfs_*() helpers are called by filesystems or by the vfs
itself to perform core operations such as create, link, mkdir, mknod, rename,
rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
inode is accessed through an idmapped mount map it into the
mount's user namespace and pass it down. Afterwards the checks and
operations are identical to non-idmapped mounts. If the initial user
namespace is passed nothing changes so non-idmapped mounts will see
identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 47291baa 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: make permission helpers idmapped mount aware

The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# b5f20061 07-May-2020 Oleg Nesterov <oleg@redhat.com>

ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()

Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
longer works if the sender doesn't have rights to send a signal.

Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
to avoid check_kill_permission().

This needs the additional notify.sigev_signo != 0 check, shouldn't we
change do_mq_notify() to deny sigev_signo == 0 ?

Test-case:

#include <signal.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/wait.h>
#include <assert.h>

static int notified;

static void sigh(int sig)
{
notified = 1;
}

int main(void)
{
signal(SIGIO, sigh);

int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
assert(fd >= 0);

struct sigevent se = {
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGIO,
};
assert(mq_notify(fd, &se) == 0);

if (!fork()) {
assert(setuid(1) == 0);
mq_send(fd, "",1,0);
return 0;
}

wait(NULL);
mq_unlink("/mq");
assert(notified);
return 0;
}

[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Reported-by: Yoji <yoji.fujihar.min@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 43afe4d3 06-Apr-2020 Somala Swaraj <somalaswaraj@gmail.com>

ipc/mqueue.c: fix a brace coding style issue

Signed-off-by: somala swaraj <somalaswaraj@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200301135530.18340-1-somalaswaraj@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c5b2cbdb 03-Feb-2020 Manfred Spraul <manfred@colorfullife.com>

ipc/mqueue.c: update/document memory barriers

Update and document memory barriers for mqueue.c:

- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
acquire semantics if the value is STATE_READY.

- use wake_q_add_safe()

- document why __set_current_state() may be used:
Reading task->state cannot happen before the wake_q_add() call,
which happens while holding info->lock. Thus the spin_unlock()
is the RELEASE, and the spin_lock() is the ACQUIRE.

For completeness: there is also a 3 CPU scenario, if the to be woken
up task is already on another wake_q.
Then:
- CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
- CPU2: the spin_lock() of the waker is the ACQUIRE
- CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
- CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE

Link: http://lkml.kernel.org/r/20191020123305.14715-4-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Waiman Long <longman@redhat.com>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed29f171 03-Feb-2020 Davidlohr Bueso <dave@stgolabs.net>

ipc/mqueue.c: remove duplicated code

pipelined_send() and pipelined_receive() are identical, so merge them.

[manfred@colorfullife.com: add changelog]
Link: http://lkml.kernel.org/r/20191020123305.14715-3-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c231740d 25-Sep-2019 Markus Elfring <elfring@users.sourceforge.net>

ipc/mqueue: improve exception handling in do_mq_notify()

Null pointers were assigned to local variables in a few cases as exception
handling. The jump target “out” was used where no meaningful data
processing actions should eventually be performed by branches of an if
statement then. Use an additional jump target for calling dev_kfree_skb()
directly.

Return also directly after error conditions were detected when no extra
clean-up is needed by this function implementation.

Link: http://lkml.kernel.org/r/592ef10e-0b69-72d0-9789-fc48f638fdfd@web.de
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97b0b1ad 25-Sep-2019 Markus Elfring <elfring@users.sourceforge.net>

ipc/mqueue.c: delete an unnecessary check before the macro call dev_kfree_skb()

dev_kfree_skb() input parameter validation, thus the test around the call
is not needed.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/07477187-63e5-cc80-34c1-32dd16b38e12@web.de
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 533770cc 03-Sep-2019 Al Viro <viro@zeniv.linux.org.uk>

new helper: get_tree_keyed()

For vfs_get_keyed_super users.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a318f12e 16-Jul-2019 Kees Cook <keescook@chromium.org>

ipc/mqueue.c: only perform resource calculation if user valid

Andreas Christoforou reported:

UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
9 * 2305843009213693951 cannot be represented in type 'long int'
...
Call Trace:
mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
evict+0x472/0x8c0 fs/inode.c:558
iput_final fs/inode.c:1547 [inline]
iput+0x51d/0x8c0 fs/inode.c:1573
mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
vfs_mkobj+0x39e/0x580 fs/namei.c:2892
prepare_open ipc/mqueue.c:731 [inline]
do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771

Which could be triggered by:

struct mq_attr attr = {
.mq_flags = 0,
.mq_maxmsg = 9,
.mq_msgsize = 0x1fffffffffffffff,
.mq_curmsgs = 0,
};

if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
perror("mq_open");

mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
preparing to return -EINVAL. During the cleanup, it calls
mqueue_evict_inode() which performed resource usage tracking math for
updating "user", before checking if there was a valid "user" at all
(which would indicate that the calculations would be sane). Instead,
delay this check to after seeing a valid "user".

The overflow was real, but the results went unused, so while the flaw is
harmless, it's noisy for kernel fuzzers, so just fix it by moving the
calculation under the non-NULL "user" where it actually gets used.

Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Andreas Christoforou <andreaschristofo@gmail.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 709a643d 12-May-2019 Al Viro <viro@zeniv.linux.org.uk>

mqueue: set ->user_ns before ->get_tree()

... so that we could lift the capability checks into ->get_tree()
caller

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a5091fda 14-May-2019 Davidlohr Bueso <dave@stgolabs.net>

ipc/mqueue: optimize msg_get()

Our msg priorities became an rbtree as of d6629859b36d ("ipc/mqueue:
improve performance of send/recv"). However, consuming a msg in
msg_get() remains logarithmic (still being better than the case before
of course). By applying well known techniques to cache pointers we can
have the node with the highest priority in O(1), which is specially nice
for the rt cases. Furthermore, some callers can call msg_get() in a
loop.

A new msg_tree_erase() helper is also added to encapsulate the tree
removal and node_cache game. Passes ltp mq testcases.

Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ecb5821 14-May-2019 Davidlohr Bueso <dave@stgolabs.net>

ipc/mqueue: remove redundant wq task assignment

We already store the current task fo the new waiter before calling
wq_sleep() in both send and recv paths. Trivially remove the redundant
assignment.

Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d6a2946a 14-May-2019 Li Rongqing <lirongqing@baidu.com>

ipc: prevent lockup on alloc_msg and free_msg

msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
enabled on large memory SMP systems, the pages initialization can take a
long time, if msgctl10 requests a huge block memory, and it will block
rcu scheduler, so release cpu actively.

After adding schedule() in free_msg, free_msg can not be called when
holding spinlock, so adding msg to a tmp list, and free it out of
spinlock

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
msgctl10 R running task 21608 32505 2794 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:__is_insn_slot_addr+0xfb/0x250
Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
kernel_text_address+0xc1/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
create_object+0x380/0x650
__kmalloc+0x14c/0x2b0
load_msg+0x38/0x1a0
do_msgsnd+0x19e/0xcf0
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
msgctl10 R running task 21608 32170 32155 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:lock_acquire+0x4d/0x340
Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
is_bpf_text_address+0x32/0xe0
kernel_text_address+0xec/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
save_stack+0x32/0xb0
__kasan_slab_free+0x130/0x180
kfree+0xfa/0x2d0
free_msg+0x24/0x50
do_msgrcv+0x508/0xe60
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Davidlohr said:
"So after releasing the lock, the msg rbtree/list is empty and new
calls will not see those in the newly populated tmp_msg list, and
therefore they cannot access the delayed msg freeing pointers, which
is good. Also the fact that the node_cache is now freed before the
actual messages seems to be harmless as this is wanted for
msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
info->lock the thing is freed anyway so it should not change things"

Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 015d7956 15-Apr-2019 Al Viro <viro@zeniv.linux.org.uk>

mqueue: switch to ->free_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 935c6912 01-Nov-2018 David Howells <dhowells@redhat.com>

ipc: Convert mqueue fs to fs_context

Convert the mqueue filesystem to use the filesystem context stuff.

Notes:

(1) The relevant ipc namespace is selected in when the context is
initialised (and it defaults to the current task's ipc namespace).
The caller can override this before calling vfs_get_tree().

(2) Rather than simply calling kern_mount_data(), mq_init_ns() and
mq_internal_mount() create a context, adjust it and then do the rest
of the mount procedure.

(3) The lazy mqueue mounting on creation of a new namespace is retained
from a previous patch, but the avoidance of sget() if no superblock
yet exists is reverted and the superblock is again keyed on the
namespace pointer.

Yes, there was a performance gain in not searching the superblock
hash, but it's only paid once per ipc namespace - and only if someone
uses mqueue within that namespace, so I'm not sure it's worth it,
especially as calling sget() allows avoidance of recursion.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8dabe724 06-Jan-2019 Arnd Bergmann <arnd@arndb.de>

y2038: syscalls: rename y2038 compat syscalls

A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.

The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.

Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.

In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# ae7795bc 25-Sep-2018 Eric W. Biederman <ebiederm@xmission.com>

signal: Distinguish between kernel_siginfo and siginfo

Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.

The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.

So split siginfo in two; kernel_siginfo and siginfo. Keeping the
traditional name for the userspace definition. While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.

The definition of struct kernel_siginfo I have put in include/signal_types.h

A set of buildtime checks has been added to verify the two structures have
the same field offsets.

To make it easy to verify the change kernel_siginfo retains the same
size as siginfo. The reduction in size comes in a following change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 9afc5eee 12-Jul-2018 Arnd Bergmann <arnd@arndb.de>

y2038: globally rename compat_time to old_time32

Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:

Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.

The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:

old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()

As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.

I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.

This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# b0d17578 13-Apr-2018 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Enable COMPAT_32BIT_TIME

Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
take a timespec argument. After we move 32-bit architectures over to
useing 64-bit time_t based syscalls, we need seperate entry points for
the old 32-bit based interfaces.

This changes the #ifdef guards for the existing 32-bit compat syscalls
to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
enabled on all existing 32-bit architectures.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 21fc538d 13-Apr-2018 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Use __kernel_timespec

This is a preparatation for changing over __kernel_timespec to 64-bit
times, which involves assigning new system call numbers for mq_timedsend(),
mq_timedreceive() and semtimedop() for compatibility with future y2038
proof user space.

The existing ABIs will remain available through compat code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# cfb2f6f6 24-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

Revert "mqueue: switch to on-demand creation of internal mount"

This reverts commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21.

Aleksa Sarai <asarai@suse.de> writes:
> [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
>
> Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
> which was introduced by 36735a6a2b5e ("mqueue: switch to on-demand
> creation of internal mount").
>
> Basically, the reproducer boils down to being able to mount mqueue if
> you create a new user namespace, even if you don't unshare the IPC
> namespace.
>
> Previously this was not possible, and you would get an -EPERM. The mount
> is the *host* mqueue mount, which is being cached and just returned from
> mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
> it was intentional -- since I'm not familiar with mqueue).
>
> To me it looks like there is a missing permission check. I've included a
> patch below that I've compile-tested, and should block the above case.
> Can someone please tell me if I'm missing something? Is this actually
> safe?
>
> [1]: https://github.com/docker/docker/issues/36674

The issue is a lot deeper than a missing permission check. sb->s_user_ns
was is improperly set as well. So in addition to the filesystem being
mounted when it should not be mounted, so things are not allow that should
be.

We are practically to the release of 4.16 and there is no agreement between
Al Viro and myself on what the code should looks like to fix things properly.
So revert the code to what it was before so that we can take our time
and discuss this properly.

Fixes: 36735a6a2b5e ("mqueue: switch to on-demand creation of internal mount")
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Reported-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# a9a08845 11-Feb-2018 Linus Torvalds <torvalds@linux-foundation.org>

vfs: do bulk POLL* -> EPOLL* replacement

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68e34f4e 06-Feb-2018 Jonathan Haws <jhaws@sdl.usu.edu>

ipc/mqueue.c: have RT tasks queue in by priority in wq_add()

Previous behavior added tasks to the work queue using the static_prio
value instead of the dynamic priority value in prio. This caused RT tasks
to be added to the work queue in a FIFO manner rather than by priority.
Normal tasks were handled by priority.

This fix utilizes the dynamic priority of the task to ensure that both RT
and normal tasks are added to the work queue in priority order. Utilizing
the dynamic priority (prio) rather than the base priority (normal_prio)
was chosen to ensure that if a task had a boosted priority when it was
added to the work queue, it would be woken sooner to to ensure that it
releases any other locks it may be holding in a more timely manner. It is
understood that the task could have a lower priority when it wakes than
when it was added to the queue in this (unlikely) case.

Link: http://lkml.kernel.org/r/1513006652-7014-1-git-send-email-jhaws@sdl.usu.edu
Signed-off-by: Jonathan Haws <jhaws@sdl.usu.edu>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# faf1f22b 05-Jan-2018 Eric W. Biederman <ebiederm@xmission.com>

signal: Ensure generic siginfos the kernel sends have all bits initialized

Call clear_siginfo to ensure stack allocated siginfos are fully
initialized before being passed to the signal sending functions.

This ensures that if there is the kind of confusion documented by
TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
data to userspace when the kernel generates a signal with SI_USER but
the copy to userspace assumes it is a different kind of signal, and
different fields are initialized.

This also prepares the way for turning copy_siginfo_to_user
into a copy_to_user, by removing the need in many cases to perform
a field by field copy simply to skip the uninitialized fields.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 36735a6a 25-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: switch to on-demand creation of internal mount

Instead of doing that upon each ipcns creation, we do that the first
time mq_open(2) or mqueue mount is done in an ipcns. What's more,
doing that allows to get rid of mount_ns() use - we can go with
considerably cheaper mount_nodev(), avoiding the loop over all
mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
instance in O(1) time instead of O(instances) mount_ns() would've
cost us.

Based upon the version by Giuseppe Scrivano <gscrivan@redhat.com>; I've
added handling of userland mqueue mounts (original had been broken in
that area) and added a switch to mount_nodev().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a713fd7f 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

tidy do_mq_open() up a bit

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9b20d7fc 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: clean prepare_open() up

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 066cc813 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

do_mq_open(): move all work prior to dentry_open() into a helper

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 05c1b290 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: fold mq_attr_ok() into mqueue_get_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# af4a5372 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

move dentry_open() calls up into do_mq_open()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# eecec19d 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9dd95748 02-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

ipc, kernel, mm: annotate ->poll() instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1751e8a6 27-Nov-2017 Linus Torvalds <torvalds@linux-foundation.org>

Rename superblock flags (MS_xyz -> SB_xyz)

This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"

SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9047726 02-Aug-2017 Deepa Dinamani <deepa.kernel@gmail.com>

ipc: mqueue: Replace timespec with timespec64

struct timespec is not y2038 safe. Replace
all uses of timespec by y2038 safe struct timespec64.

Even though timespec is used here to represent timeouts,
replace these with timespec64 so that it facilitates
in verification by creating a y2038 safe kernel image
that is free of timespec.

The syscall interfaces themselves are not changed as part
of the patch. They will be part of a different series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f991af3d 09-Jul-2017 Cong Wang <xiyou.wangcong@gmail.com>

mqueue: fix a use-after-free in sys_mq_notify()

The retry logic for netlink_attachskb() inside sys_mq_notify()
is nasty and vulnerable:

1) The sock refcnt is already released when retry is needed
2) The fd is controllable by user-space because we already
release the file refcnt

so we when retry but the fd has been just closed by user-space
during this small window, we end up calling netlink_detachskb()
on the error path which releases the sock again, later when
the user-space closes this socket a use-after-free could be
triggered.

Setting 'sock' to NULL here should be sufficient to fix it.

Reported-by: GeneBlue <geneblue.mail@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0d060606 27-Jun-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: move compat syscalls to native ones

... and stop messing with compat_alloc_user_space() and friends

[braino fix from Colin King folded in]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8703e8a4 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h>

We are going to split <linux/sched/user.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/user.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 3f07c014 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>

We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 84f001e1 01-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/wake_q.h>

We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/wake_q.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# eac0b1c3 27-Feb-2017 Luc Van Oostenryck <luc.vanoostenryck@gmail.com>

ipc/mqueue: add missing sparse annotation

Link: http://lkml.kernel.org/r/20170128235704.45302-1-luc.vanoostenryck@gmail.com
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 194a6b5b 17-Nov-2016 Waiman Long <longman@redhat.com>

sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q

Currently the wake_q data structure is defined by the WAKE_Q() macro.
This macro, however, looks like a function doing something as "wake" is
a verb. Even checkpatch.pl was confused as it reported warnings like

WARNING: Missing a blank line after declarations
#548: FILE: kernel/futex.c:3665:
+ int ret;
+ WAKE_Q(wake_q);

This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
what the macro is doing and eliminates the checkpatch.pl warnings.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
[ Resolved conflict and added missing rename. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 078cd827 14-Sep-2016 Deepa Dinamani <deepa.kernel@gmail.com>

fs: Replace CURRENT_TIME with current_time() for inode timestamps

CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a2982cc9 09-Jun-2016 Eric W. Biederman <ebiederm@xmission.com>

vfs: Generalize filesystem nodev handling.

Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV. Use this new function in all of the
places where MNT_NODEV was previously tested.

Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong. With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 3ee69014 20-May-2016 Eric W. Biederman <ebiederm@xmission.com>

ipc/mqueue: The mqueue filesystem should never contain executables

Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes
do not result in executable on mqueuefs by accident.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# d91ee87d 23-May-2016 Eric W. Biederman <ebiederm@xmission.com>

vfs: Pass data, ns, and ns->userns to mount_ns

Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.

Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.

Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks. The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts. Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.

Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5d097056 14-Jan-2016 Vladimir Davydov <vdavydov.dev@gmail.com>

kmemcg: account certain kmem allocations to memcg

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:

- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# de54b9ac 06-Aug-2015 Marcus Gelderie <redmnic@gmail.com>

ipc: modify message queue accounting to not take kernel data structures into account

A while back, the message queue implementation in the kernel was
improved to use btrees to speed up retrieval of messages, in commit
d6629859b36d ("ipc/mqueue: improve performance of send/recv").

That patch introducing the improved kernel handling of message queues
(using btrees) has, as a by-product, changed the meaning of the QSIZE
field in the pseudo-file created for the queue. Before, this field
reflected the size of the user-data in the queue. Since, it also takes
kernel data structures into account. For example, if 13 bytes of user
data are in the queue, on my machine the file reports a size of 61
bytes.

There was some discussion on this topic before (for example
https://lkml.org/lkml/2014/10/1/115). Commenting on a th lkml, Michael
Kerrisk gave the following background
(https://lkml.org/lkml/2015/6/16/74):

The pseudofiles in the mqueue filesystem (usually mounted at
/dev/mqueue) expose fields with metadata describing a message
queue. One of these fields, QSIZE, as originally implemented,
showed the total number of bytes of user data in all messages in
the message queue, and this feature was documented from the
beginning in the mq_overview(7) page. In 3.5, some other (useful)
work happened to break the user-space API in a couple of places,
including the value exposed via QSIZE, which now includes a measure
of kernel overhead bytes for the queue, a figure that renders QSIZE
useless for its original purpose, since there's no way to deduce
the number of overhead bytes consumed by the implementation.
(The other user-space breakage was subsequently fixed.)

This patch removes the accounting of kernel data structures in the
queue. Reporting the size of these data-structures in the QSIZE field
was a breaking change (see Michael's comment above). Without the QSIZE
field reporting the total size of user-data in the queue, there is no
way to deduce this number.

It should be noted that the resource limit RLIMIT_MSGQUEUE is counted
against the worst-case size of the queue (in both the old and the new
implementation). Therefore, the kernel overhead accounting in QSIZE is
not necessary to help the user understand the limitations RLIMIT imposes
on the processes.

Signed-off-by: Marcus Gelderie <redmnic@gmail.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: John Duffy <jb_duffy@btinternet.com>
Cc: Arto Bendiken <arto@bendiken.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa6004ad 04-May-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc/mqueue: Implement lockless pipelined wakeups

This patch moves the wakeup_process() invocation so it is not done under
the info->lock by making use of a lockless wake_q. With this change, the
waiter is woken up once it is STATE_READY and it does not need to loop
on SMP if it is still in STATE_PENDING. In the timeout case we still need
to grab the info->lock to verify the state.

This change should also avoid the introduction of preempt_disable() in -rt
which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
change if the waiter has a higher priority compared to the waker.

Additionally, this patch micro-optimizes wq_sleep by using the cheaper
cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no
matter what, thus get rid of the implied barrier.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: George Spelvin <linux@horizon.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Mason <clm@fb.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1430748166.1940.17.camel@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 75c3cfa8 17-Mar-2015 David Howells <dhowells@redhat.com>

VFS: assorted weird filesystems: d_inode() annotations

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9f45f5bf 31-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

new helper: audit_file()

... for situations when we don't have any candidate in pathnames - basically,
in descriptor-based syscalls.

[Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6d08a256 07-Apr-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: use device_initcall

... since __initcall is now deprecated.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f3713fd9 25-Feb-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc,mqueue: remove limits for the amount of system-wide queues

Commit 93e6f119c0ce ("ipc/mqueue: cleanup definition names and
locations") added global hardcoded limits to the amount of message
queues that can be created. While these limits are per-namespace,
reality is that it ends up breaking userspace applications.
Historically users have, at least in theory, been able to create up to
INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
for some workloads and use cases. For instance, Madars reports:

"This update imposes bad limits on our multi-process application. As
our app uses approaches that each process opens its own set of queues
(usually something about 3-5 queues per process). In some scenarios
we might run up to 3000 processes or more (which of-course for linux
is not a problem). Thus we might need up to 9000 queues or more. All
processes run under one user."

Other affected users can be found in launchpad bug #1155695:
https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695

Instead of increasing this limit, revert it entirely and fallback to the
original way of dealing queue limits -- where once a user's resource
limit is reached, and all memory is used, new queues cannot be created.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reported-by: Madars Vitolins <m@silodev.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ab08fe2 27-Jan-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: remove braces for single statements

Deal with checkpatch messages:
WARNING: braces {} are not necessary for single statement blocks

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 239521f3 27-Jan-2014 Manfred Spraul <manfred@colorfullife.com>

ipc: whitespace cleanup

The ipc code does not adhere the typical linux coding style.
This patch fixes lots of simple whitespace errors.

- mostly autogenerated by
scripts/checkpatch.pl -f --fix \
--types=pointer_location,spacing,space_before_tab
- one manual fixup (keep structure members tab-aligned)
- removal of additional space_before_tab that were not found by --fix

Tested with some of my msg and sem test apps.

Andrew: Could you include it in -mm and move it towards Linus' tree?

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Suggested-by: Li Bin <huawei.libin@huawei.com>
Cc: Joe Perches <joe@perches.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b21996e3 20-Sep-2011 J. Bruce Fields <bfields@redhat.com>

locks: break delegations on unlink

We need to break delegations on any operation that changes the set of
links pointing to an inode. Start with unlink.

Such operations also hold the i_mutex on a parent directory. Breaking a
delegation may require waiting for a timeout (by default 90 seconds) in
the case of a unresponsive NFS client. To avoid blocking all directory
operations, we therefore drop locks before waiting for the delegation.
The logic then looks like:

acquire locks
...
test for delegation; if found:
take reference on inode
release locks
wait for delegation break
drop reference on inode
retry

It is possible this could never terminate. (Even if we take precautions
to prevent another delegation being acquired on the same inode, we could
get a different inode on each retry.) But this seems very unlikely.

The initial test for a delegation happens after the lock on the target
inode is acquired, but the directory inode may have been acquired
further up the call stack. We therefore add a "struct inode **"
argument to any intervening functions, which we use to pass the inode
back up to the caller in the case it needs a delegation synchronously
broken.

Cc: David Howells <dhowells@redhat.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 79f6530c 08-Jul-2013 Jeff Layton <jlayton@kernel.org>

audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names record

The old audit PATH records for mq_open looked like this:

type=PATH msg=audit(1366282323.982:869): item=1 name=(null) inode=6777
dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
obj=system_u:object_r:tmpfs_t:s15:c0.c1023
type=PATH msg=audit(1366282323.982:869): item=0 name="test_mq" inode=26732
dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023

...with the audit related changes that went into 3.7, they now look like this:

type=PATH msg=audit(1366282236.776:3606): item=2 name=(null) inode=66655
dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023
type=PATH msg=audit(1366282236.776:3606): item=1 name=(null) inode=6926
dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
obj=system_u:object_r:tmpfs_t:s15:c0.c1023
type=PATH msg=audit(1366282236.776:3606): item=0 name="test_mq"

Both of these look wrong to me. As Steve Grubb pointed out:

"What we need is 1 PATH record that identifies the MQ. The other PATH
records probably should not be there."

Fix it to record the mq root as a parent, and flag it such that it
should be hidden from view when the names are logged, since the root of
the mq filesystem isn't terribly interesting. With this change, we get
a single PATH record that looks more like this:

type=PATH msg=audit(1368021604.836:484): item=0 name="test_mq" inode=16914
dev=00:0c mode=0100644 ouid=0 ogid=0 rdev=00:00
obj=unconfined_u:object_r:user_tmpfs_t:s0

In order to do this, a new audit_inode_parent_hidden() function is
added. If we do it this way, then we avoid having the existing callers
of audit_inode needing to do any sort of flag conversion if auditing is
inactive.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Jiri Jaburek <jjaburek@redhat.com>
Cc: Steve Grubb <sgrubb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a636b702 21-Mar-2013 Eric W. Biederman <ebiederm@xmission.com>

ipc: Restrict mounting the mqueue filesystem

Only allow mounting the mqueue filesystem if the caller has CAP_SYS_ADMIN
rights over the ipc namespace. The principle here is if you create
or have capabilities over it you can mount it, otherwise you get to live
with what other people have mounted.

This information is not particularly sensitive and mqueue essentially
only reports which posix messages queues exist. Still when creating a
restricted environment for an application to live any extra
information may be of use to someone with sufficient creativity. The
historical if imperfect way this information has been restricted has
been not to allow mounts and restricting this to ipc namespace
creators maintains the spirit of the historical restriction.

Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 38d78e58 22-Mar-2013 Vladimir Davydov <vdavydov.dev@gmail.com>

mqueue: sys_mq_open: do not call mnt_drop_write() if read-only

mnt_drop_write() must be called only if mnt_want_write() succeeded,
otherwise the mnt_writers counter will diverge.

mnt_writers counters are used to check if remounting FS as read-only is
OK, so after an extra mnt_drop_write() call, it would be impossible to
remount mqueue FS as read-only. Besides, on umount a warning would be
printed like this one:

=====================================
[ BUG: bad unlock balance detected! ]
3.9.0-rc3 #5 Not tainted
-------------------------------------
a.out/12486 is trying to release lock (sb_writers) at:
mnt_drop_write+0x1f/0x30
but there are no more locks to release!

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# bc1b69ed 27-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

userns: Allow the unprivileged users to mount mqueue fs

This patch allow the unprivileged user to mount mqueuefs in
user ns.

If two userns share the same ipcns,the files in mqueue fs
should be seen in both these two userns.

If the userns has its own ipcns,it has its own mqueue fs too.
ipcns has already done this job well.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# adb5c247 10-Oct-2012 Jeff Layton <jlayton@kernel.org>

audit: make audit_inode take struct filename

Keep a pointer to the audit_names "slot" in struct filename.

Have all of the audit_inode callers pass a struct filename ponter to
audit_inode instead of a string pointer. If the aname field is already
populated, then we can skip walking the list altogether and just use it
directly.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 91a27b2a 10-Oct-2012 Jeff Layton <jlayton@kernel.org>

vfs: define struct filename and have getname() return it

getname() is intended to copy pathname strings from userspace into a
kernel buffer. The result is just a string in kernel space. It would
however be quite helpful to be able to attach some ancillary info to
the string.

For instance, we could attach some audit-related info to reduce the
amount of audit-related processing needed. When auditing is enabled,
we could also call getname() on the string more than once and not
need to recopy it from userspace.

This patchset converts the getname()/putname() interfaces to return
a struct instead of a string. For now, the struct just tracks the
string in kernel space and the original userland pointer for it.

Later, we'll add other information to the struct as it becomes
convenient.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# bfcec708 10-Oct-2012 Jeff Layton <jlayton@kernel.org>

audit: set the name_len in audit_inode for parent lookups

Currently, this gets set mostly by happenstance when we call into
audit_inode_child. While that might be a little more efficient, it seems
wrong. If the syscall ends up failing before audit_inode_child ever gets
called, then you'll have an audit_names record that shows the full path
but has the parent inode info attached.

Fix this by passing in a parent flag when we call audit_inode that gets
set to the value of LOOKUP_PARENT. We can then fix up the pathname for
the audit entry correctly from the get-go.

While we're at it, clean up the no-op macro for audit_inode in the
!CONFIG_AUDITSYSCALL case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1638113d 08-Oct-2012 Michel Lespinasse <walken@google.com>

ipc/mqueue: remove unnecessary rb_init_node() calls

Commit d6629859b36d ("ipc/mqueue: improve performance of send/recv") and
ce2d52cc ("ipc/mqueue: add rbtree node caching support") introduced an
rbtree of message priorities, and usage of rb_init_node() to initialize
the corresponding nodes. As it turns out, rb_init_node() is unnecessary
here, as the nodes are fully initialized on insertion by rb_link_node()
and the code doesn't access nodes that aren't inserted on the rbtree.

Removing the rb_init_node() calls as I removed that function during
rbtree API cleanups (the only other use of it was in a place that
similarly didn't require it).

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2903ff01 27-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

switch simple cases of fget_light to fdget

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 515e0d66 27-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

switch mqueue syscalls to fget_light()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 312b90fb 06-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

mqueue: lift mnt_want_write() outside ->i_mutex, clean up a bit

the way it abuses ->d_fsdata still needs to be killed, but that's
a separate story.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 765927b2 26-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

switch dentry_open() to struct path, make it grab references itself

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 312b63fb 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

don't pass nameidata * to vfs_create()

all we want is a boolean flag, same as the method gets now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ebfc3b49 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

don't pass nameidata to ->create()

boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ce2d52cc 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: add rbtree node caching support

When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).

This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.

Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.

Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.

Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.

The net result of all of this is:

1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.

2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).

3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.

The performance changes are:

Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns

I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.

As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 113289cc 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: strengthen checks on mqueue creation

We already check the mq attr struct if it's passed in, but now that the
admin can set system wide defaults separate from maximums, it's actually
possible to set the defaults to something that would overflow. So, if
there is no attr struct passed in to the open call, check the default
values.

While we are at it, simplify mq_attr_ok() by making it return 0 or an
error condition, so that way if we add more tests to it later, we have the
option of what error should be returned instead of the calling location
having to pick a possibly inaccurate error code.

[akpm@linux-foundation.org: s/ENOMEM/EOVERFLOW/]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2c12ea49 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: correct mq_attr_ok test

While working on the other parts of the mqueue stuff, I noticed that the
calculation for overflow in mq_attr_ok didn't actually match reality (this
is especially true since my last patch which changed how we account memory
slightly).

In particular, we used to test for overflow using:
msgs * msgsize + msgs * sizeof(struct msg_msg *)

That was never really correct because each message we allocate via
load_msg() is actually a struct msg_msg followed by the data for the
message (and if struct msg_msg + data exceeds PAGE_SIZE we end up
allocating struct msg_msgseg structs too, but accounting for them would
get really tedious, so let's ignore those...they're only a pointer in size
anyway). This patch updates the calculation to be more accurate in
regards to maximum possible memory consumption by the mqueue.

[akpm@linux-foundation.org: add a local to simplify overflow-checking expression]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d6629859 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: improve performance of send/recv

The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.

OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.

"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:

Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.

So, on to the results already:

Subsystem/Test Old New

Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s

Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s

% slowdown from mqueue
cache thrashing ~8% ~.5%

Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436

Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17

So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.

[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cef0184c 31-May-2012 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mqueue: separate mqueue default value from maximum value

Commit b231cca4381e ("message queues: increase range limits") changed
mqueue default value when attr parameter is specified NULL from hard
coded value to fs.mqueue.{msg,msgsize}_max sysctl value.

This made large side effect. When user need to use two mqueue
applications 1) using !NULL attr parameter and it require big message
size and 2) using NULL attr parameter and only need small size message,
app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
memory size even though it doesn't need.

Doug Ledford propsed to switch back it to static hard coded value.
However it also has a compatibility problem. Some applications might
started depend on the default value is tunable.

The solution is to separate default value from maximum value.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Joe Korty <joe.korty@ccur.com>
Cc: Amerigo Wang <amwang@redhat.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd1f87d2 31-May-2012 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mqueue: don't use kmalloc with KMALLOC_MAX_SIZE

KMALLOC_MAX_SIZE is not a good threshold. It is extremely high and
problematic. Unfortunately, some silly drivers depend on this and we
can't change it. But any new code needn't use such extreme ugly high
order allocations. It brings us awful fragmentation issues and system
slowdown.

Signed-off-by: KOSAKI Motohiro <mkosaki@jp.fujitsu.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Joe Korty <joe.korty@ccur.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b5c4d1a 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: update maximums for the mqueue subsystem

Commit b231cca4381e ("message queues: increase range limits") changed the
maximum size of a message in a message queue from INT_MAX to 8192*128.
Unfortunately, we had customers that relied on a size much larger than
8192*128 on their production systems. After reviewing POSIX, we found
that it is silent on the maximum message size. We did find a couple other
areas in which it was not silent. Fix up the mqueue maximums so that the
customer's system can continue to work, and document both the POSIX and
real world requirements in ipc_namespace.h so that we don't have this
issue crop back up.

Also, commit 9cf18e1dd74cd0 ("ipc: HARD_MSGMAX should be higher not lower
on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
intentionally in place to limit the msg queue depth to one that was small
enough to kmalloc an array of pointers (hence why we divided 128k by
sizeof(long)). If we wish to meet POSIX requirements, we have no choice
but to change our allocation to a vmalloc instead (at least for the large
queue size case). With that, it's possible to increase our allowed
maximum to the POSIX requirements (or more if we choose).

[sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 02967ea0 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: enforce hard limits

In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps.
In preparation for making more reasonable hard limits, start enforcing
them even on CAP_SYS_RESOURCE.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 858ee378 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: switch back to using non-max values on create

Commit b231cca4381e ("message queues: increase range limits") changed
how we create a queue that does not include an attr struct passed to
open so that it creates the queue with whatever the maximum values are.
However, if the admin has set the maximums to allow flexibility in
creating a queue (aka, both a large size and large queue are allowed,
but combined they create a queue too large for the RLIMIT_MSGQUEUE of
the user), then attempts to create a queue without an attr struct will
fail. Switch back to using acceptable defaults regardless of what the
maximums are.

Note: so far, we only know of a few applications that rely on this
behavior (specifically, set the maximums in /proc, then run the
application which calls mq_open() without passing in an attr struct, and
the application expects the newly created message queue to have the
maximum sizes that were set in /proc used on the mq_open() call, and all
of those applications that we know of are actually part of regression
test suites that were coded to do something like this:

for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do
echo $size > /proc/sys/fs/mqueue/msgsize_max
mq_open || echo "Error opening mq with size $size"
done

These test suites that depend on any behavior like this are broken. The
concept that programs should rely upon the system wide maximum in order
to get their desired results instead of simply using a attr struct to
specify what they want is fundamentally unfriendly programming practice
for any multi-tasking OS.

Fixing this will break those few apps that we know of (and those app
authors recognize the brokenness of their code and the need to fix it).
However, the following patch "mqueue: separate mqueue default value"
allows a workaround in the form of new knobs for the default msg queue
creation parameters for any software out there that we don't already
know about that might rely on this behavior at the moment.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dbd5768f 03-May-2012 Jan Kara <jack@suse.cz>

vfs: Rename end_writeback() to clear_inode()

After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>


# 76b6db01 14-Mar-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Replace user_ns_map_uid and user_ns_map_gid with from_kuid and from_kgid

These function are no longer needed replace them with their more useful equivalents.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 6f9ac6d9 16-Nov-2011 Eric W. Biederman <ebiederm@xmission.com>

mqueue: Explicitly capture the user namespace to send the notification to.

Stop relying on user->user_ns which is going away and instead capture
the user_namespace of the process we are supposed to notify.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 48fde701 08-Jan-2012 Al Viro <viro@zeniv.linux.org.uk>

switch open-coded instances of d_make_root() to new helper

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2a4e64b8 20-Jan-2012 Davidlohr Bueso <dave@gnu.org>

ipc/mqueue: simplify reading msgqueue limit

Because the current task is being used to get the limit, we can simply
use rlimit() instead of task_rlimit().

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6b550f94 10-Jan-2012 Serge E. Hallyn <serge@hallyn.com>

user namespace: make signal.c respect user namespaces

ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
user namespace. (new, thanks Oleg)

__send_signal: convert current's uid to the recipient's user namespace
for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
again :)

do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
user namespace

ptrace_signal maps parent's uid into current's user namespace before
including in signal to current. IIUC Oleg has argued that this shouldn't
matter as the debugger will play with it, but it seems like not converting
the value currently being set is misleading.

Changelog:
Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
simplify callers and help make clear what we are translating
(which uid into which namespace). Passing the target task would
make callers even easier to read, but we pass in user_ns because
current_user_ns() != task_cred_xxx(current, user_ns).
Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
in ptrace_signal().
Sep 23: In send_signal(), detect when (user) signal is coming from an
ancestor or unrelated user namespace. Pass that on to __send_signal,
which sets si_uid to 0 or overflowuid if needed.
Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
SI_FROMKERNEL cases at callers, because we can't assume sender is
current in those cases.
Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
Nov 10: (akpm) make the !CONFIG_USER_NS case clearer

Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Serge Hallyn <serge.hallyn@canonical.com>
Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)

Eric Biederman pointed out that passing info is a bug and could lead to a
NULL pointer deref to boot.

A collection of signal, securebits, filecaps, cap_bounds, and a few other
ltp tests passed with this kernel.

Changelog:
Nov 18: previous patch missed a leading '&'

Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: ipc/mqueue: lock() => unlock() typo

There was a double lock typo introduced in b085f4bd6b21 "user namespace:
make signal.c respect user namespaces"

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df0a4283 26-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch mq_open() to umode_t


# 1b9d5ff7 24-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

mqueue: propagate umode_t

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4acdaf27 25-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch ->create() to umode_t

vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6b520e05 12-Dec-2011 Al Viro <viro@zeniv.linux.org.uk>

vfs: fix the stupidity with i_dentry in inode destructors

Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6f686574 08-Dec-2011 Al Viro <viro@zeniv.linux.org.uk>

... and the same kind of leak for mqueue

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 32ea845d 31-Oct-2011 Wanlong Gao <gaowanlong@cn.fujitsu.com>

ipc/mqueue.c: fix wrong use of schedule_hrtimeout_range_clock()

Fix the wrong use of schedule_hrtimeout_range_clock() in wq_sleep(),
although it is harmless for the syscall mq_timed* now. It was introduced
by 9ca7d8e ("mqueue: Convert message queue timeout to use hrtimers").

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Carsten Emde <C.Emde@osadl.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d40dcdb0 26-Jul-2011 Jiri Slaby <jirislaby@kernel.org>

ipc/mqueue.c: fix mq_open() return value

We return ENOMEM from mqueue_get_inode even when we have enough memory.
Namely in case the system rlimit of mqueue was reached. This error
propagates to mq_queue and user sees the error unexpectedly. So fix
this up to properly return EMFILE as described in the manpage:

EMFILE The process already has the maximum number of files and
message queues open.

instead of:

ENOMEM Insufficient memory.

With the previous patch we just switch to ERR_PTR/PTR_ERR/IS_ERR error
handling here.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 04715206 26-Jul-2011 Jiri Slaby <jirislaby@kernel.org>

ipc/mqueue.c: refactor failure handling

If new_inode fails to allocate an inode we need only to return with
NULL. But now we test the opposite and have all the work in a nested
block. So do the opposite to save one indentation level (and remove
unnecessary line breaks).

This is only a preparation/cleanup for the next patch where we fix up
return values from mqueue_get_inode.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa0d7e3d 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: icache RCU free inodes

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# ceefda69 26-Jul-2010 Al Viro <viro@zeniv.linux.org.uk>

switch get_sb_ns() users

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 85fe4025 23-Oct-2010 Christoph Hellwig <hch@lst.de>

fs: do not assign default i_ino in new_inode

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7de9c6ee 23-Oct-2010 Al Viro <viro@zeniv.linux.org.uk>

new helper: ihold()

Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6038f373 15-Aug-2010 Arnd Bergmann <arnd@arndb.de>

llseek: automatically add .llseek fop

All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.

The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.

New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.

The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.

Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.

Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.

===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}

@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}

@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}

@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}

@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}

@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}

@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};

@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};

@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};

@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};

@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};

// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};

@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};

// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};

// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};

// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};

@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};

// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////

@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};

@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};

@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};

@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>


# 6d8af64c 05-Jun-2010 Al Viro <viro@zeniv.linux.org.uk>

switch mqueue to ->evict_inode()

... and since the inodes are never hashed, we can use default ->drop_inode()
just fine.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0abbb609 28-May-2010 Al Viro <viro@zeniv.linux.org.uk>

mqueue doesn't need make_bad_inode()

It never hashes them anyway and does final iput() immediately
afterwards. With ->drop_inode() being generic_delete_inode()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a3ed2a15 11-May-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: fix kernel BUG caused by double free() on mq_open()

In case of aborting because we reach the maximum amount of memory which
can be allocated to message queues per user (RLIMIT_MSGQUEUE), we would
try to free the message area twice when bailing out: first by the error
handling code itself, and then later when cleaning up the inode through
delete_inode().

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9ca7d8e6 02-Apr-2010 Carsten Emde <C.Emde@osadl.org>

mqueue: Convert message queue timeout to use hrtimers

The message queue functions mq_timedsend() and mq_timedreceive()
have not yet been converted to use the hrtimer interface.

This patch replaces the call to schedule_timeout() by a call to
schedule_hrtimeout() and transforms the expiration time from
timespec to ktime as required.

[ tglx: Fixed whitespace wreckage ]

Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Tested-by: Pradyumna Sampath <pradysam@gmail.com>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20100402204331.715783034@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# f1eb1332 10-Mar-2010 Jiri Slaby <jirislaby@kernel.org>

ipc: use rlimit helpers

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in
3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2329e392 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: fix typo "failues" -> "failures"

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8d8ffefa 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: only set error codes if they are really necessary

... postponing assignments until they're needed. Doesn't change code size.

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 04db0dde 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: simplify do_open() error handling

It reduces code size:
text data bss dec hex filename
9925 72 16 10013 271d ipc/mqueue-BEFORE.o
9885 72 16 9973 26f5 ipc/mqueue-AFTER.o

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8834cf79 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: apply mathematics distributivity on mq_bytes calculation

Code size reduction:
text data bss dec hex filename
9941 72 16 10029 272d ipc/mqueue-BEFORE.o
9925 72 16 10013 271d ipc/mqueue-AFTER.o

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c8308b1c 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: remove unneeded info->messages initialization

... and abort earlier if we couldn't allocate the message pointers array,
avoiding the u->mq_bytes accounting logic.

It reduces code size:
text data bss dec hex filename
9949 72 16 10037 2735 ipc/mqueue-BEFORE.o
9941 72 16 10029 272d ipc/mqueue-AFTER.o

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4294a8ee 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: fix mq_open() file descriptor leak on user-space processes

We leak fd on lookup_one_len() failure

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b65a9cfc 16-Dec-2009 Al Viro <viro@zeniv.linux.org.uk>

Untangling ima mess, part 2: deal with counters

* do ima_get_count() in __dentry_open()
* stop doing that in followups
* move ima_path_check() to right after nameidata_to_filp()
* don't bump counters on it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b87221de 21-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

const: mark remaining super_operations const

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 46690f37 26-Jun-2009 Mimi Zohar <zohar@linux.vnet.ibm.com>

integrity: ima mq_open imbalance msg fix

This patch fixes an imbalance message as reported by Sanchin Sant.
As we don't need to measure the message queue, just increment the
counters.

Reported-by: Sanchin Sant <sanchinp@in.ibm.com>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# bdc8e5f8 06-Apr-2009 Serge E. Hallyn <serue@us.ibm.com>

namespaces: mqueue namespace: adapt sysctl

Largely inspired from ipc/ipc_sysctl.c. This patch isolates the mqueue
sysctl stuff in its own file.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7eafd7c7 06-Apr-2009 Serge E. Hallyn <serue@us.ibm.com>

namespaces: ipc namespaces: implement support for posix msqueues

Implement multiple mounts of the mqueue file system, and link it to usage
of CLONE_NEWIPC.

Each ipc ns has a corresponding mqueuefs superblock. When a user does
clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
internal mount of a new mqueuefs sb linked to the new ipc ns.

When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
mqueuefs superblock.

Posix message queues can be worked with both through the mq_* system calls
(see mq_overview(7)), and through the VFS through the mqueue mount. Any
usage of mq_open() and friends will work with the acting task's ipc
namespace. Any actions through the VFS will work with the mqueuefs in
which the file was created. So if a user doesn't remount mqueuefs after
unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
/dev/mqueue".

If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
ipc_ns:1 will be freed, (2) it's superblock will live on until task b
umounts the corresponding mqueuefs, and vfs actions will continue to
succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
the deceased ipc_ns:1.

To make this happen, we must protect the ipc reference count when

a) a task exits and drops its ipcns->count, since it might be dropping
it to 0 and freeing the ipcns

b) a task accesses the ipcns through its mqueuefs interface, since it
bumps the ipcns refcount and might race with the last task in the ipcns
exiting.

So the kref is changed to an atomic_t so we can use
atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
through ns = mqueuefs_sb->s_fs_info is protected by the same lock.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 614b84cf 06-Apr-2009 Serge E. Hallyn <serue@us.ibm.com>

namespaces: mqueue ns: move mqueue_mnt into struct ipc_namespace

Move mqueue vfsmount plus a few tunables into the ipc_namespace struct.
The CONFIG_IPC_NS boolean and the ipc_namespace struct will serve both the
posix message queue namespaces and the SYSV ipc namespaces.

The sysctl code will be fixed separately in patch 3. After just this
patch, making a change to posix mqueue tunables always changes the values
in the initial ipc namespace.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce3b0f8d 29-Mar-2009 Al Viro <viro@zeniv.linux.org.uk>

New helper - current_umask()

current->fs->umask is what most of fs_struct users are doing.
Put that into a helper function.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# db1dd4d3 06-Feb-2009 Jonathan Corbet <corbet@lwn.net>

Use f_lock to protect f_flags

Traditionally, changes to struct file->f_flags have been done under BKL
protection, or with no protection at all. This patch causes all f_flags
changes after file open/creation time to be done under protection of
f_lock. This allows the removal of some BKL usage and fixes a number of
longstanding (if microscopic) races.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>


# c4ea37c2 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] System call wrappers part 26

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# d5460c99 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] System call wrappers part 25

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 2ed7c03e 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] Convert all system calls to return a long

Convert all system calls to return a long. This should be a NOP since all
converted types should have the same size anyway.
With the exception of sys_exit_group which returned void. But that doesn't
matter since the system call doesn't return.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# a6684999 07-Jan-2009 Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

mqueue: fix si_pid value in mqueue do_notify()

If a process registers for asynchronous notification on a POSIX message
queue, it gets a signal and a siginfo_t structure when a message arrives
on the message queue. The si_pid in the siginfo_t structure is set to the
PID of the process that sent the message to the message queue.

The principle is the following:
. when mq_notify(SIGEV_SIGNAL) is called, the caller registers for
notification when a msg arrives. The associated pid structure is stroed into
inode_info->notify_owner. Let's call this process P1.
. when mq_send() is called by say P2, P2 sends a signal to P1 to notify
him about msg arrival.

The way .si_pid is set today is not correct, since it doesn't take into account
the fact that the process that is sending the message might not be in the
same namespace as the notified one.

This patch proposes to set si_pid to the sender's pid into the notify_owner
namespace.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Bastian Blank <bastian@waldi.eu.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56ff5efa 09-Dec-2008 Al Viro <viro@zeniv.linux.org.uk>

zero i_uid/i_gid on inode allocation

... and don't bother in callers. Don't bother with zeroing i_blocks,
while we are at it - it's already been zeroed.

i_mode is not worth the effort; it has no common default value.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 564f6993 14-Dec-2008 Al Viro <viro@zeniv.linux.org.uk>

sanitize audit_mq_open()

* don't bother with allocations
* don't do double copy_from_user()
* don't duplicate parts of check for audit_dummy_context()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c32c8af4 14-Dec-2008 Al Viro <viro@zeniv.linux.org.uk>

sanitize AUDIT_MQ_SENDRECV

* logging the original value of *msg_prio in mq_timedreceive(2)
is insane - the argument is write-only (i.e. syscall always
ignores the original value and only overwrites it).
* merge __audit_mq_timed{send,receive}
* don't do copy_from_user() twice
* don't mess with allocations in auditsc part
* ... and don't bother checking !audit_enabled and !context in there -
we'd already checked for audit_dummy_context().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 20114f71 10-Dec-2008 Al Viro <viro@zeniv.linux.org.uk>

sanitize audit_mq_notify()

* don't copy_from_user() twice
* don't bother with allocations
* don't duplicate parts of audit_dummy_context()
* make it return void

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7392906e 10-Dec-2008 Al Viro <viro@zeniv.linux.org.uk>

sanitize audit_mq_getsetattr()

* get rid of allocations
* make it return void
* don't duplicate parts of audit_dummy_context()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 745ca247 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Pass credentials through dentry_open()

Pass credentials through dentry_open() so that the COW creds patch can have
SELinux's flush_unauthorized_files() pass the appropriate creds back to itself
when it opens its null chardev.

The security_dentry_open() call also now takes a creds pointer, as does the
dentry_open hook in struct security_operations.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>


# 86a264ab 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Wrap current->cred and a few other accessors

Wrap current->cred and a few other accessors to hide their actual
implementation.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# b6dff3ec 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Separate task security context from task_struct

Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 414c0708 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Wrap task credential accesses in the SYSV IPC subsystem

Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.

Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# b231cca4 18-Oct-2008 Joe Korty <joe.korty@ccur.com>

message queues: increase range limits

Increase the range of various posix message queue limits.

Posix gives the message queue user the ability to 'trade off' the maximum
size of messages with the number of possible messages that can be 'in
flight'. Linux currently makes this trade off more restrictive than it
needs to be.

In particular, the maximum message size today can be made no smaller than
8192. This greatly restricts those applications that would like to have
the ability to post large numbers of very small messages.

So this task lowers the limit that the maximum message size can be set to,
from 8192 to 128. It also lowers the limit that the maximum #number of
messages in flight can be set to, from 10 to 1.

With these changes the message queue user can make better trade offs
between #messages and message size, in order to get everything to fit
within the setrlimit(RLIMIT_MSGQUEUE) limit for that particular user.

This patch also applies the values in

/proc/sys/fs/mqueue/msg_max
/proc/sys/fs/mqueue/msgsize_max

as the defaults for the max #messages allowed and the max message size
allowed, respectively, for those applications that do not supply these.
Previously, the defaults were hardwired to 10 and 8192, respectively.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f419a2e3 21-Jul-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] kill nameidata passing to permission(), rename to inode_permission()

Incidentally, the name that gives hundreds of false positives on grep
is not a good idea...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 51cc5068 25-Jul-2008 Alexey Dobriyan <adobriyan@gmail.com>

SL*B: drop kmem cache argument from constructor

Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.

Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c

This is flag day, yes.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f1a43f93 25-Jul-2008 Akinobu Mita <akinobu.mita@gmail.com>

ipc: use simple_read_from_buffer()

Also this patch kills unneccesary trailing NULL character.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9457afee 05-Jun-2008 Denis V. Lunev <den@openvz.org>

netlink: Remove nonblock parameter from netlink_attachskb

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 269f2134 03-May-2008 Ulrich Drepper <drepper@redhat.com>

tiny mq_open optimization

A very small cleanup for mq_open.

We do not have to call set_close_on_exit if we create the file
descriptor right away with the flag set. We have a function for this
now. The resulting code is smaller and a tiny bit faster.

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4a3fd211 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com>

[PATCH] r/o bind mounts: elevate write count for open()s

This is the first really tricky patch in the series. It elevates the writer
count on a mount each time a non-special file is opened for write.

We used to do this in may_open(), but Miklos pointed out that __dentry_open()
is used as well to create filps. This will cover even those cases, while a
call in may_open() would not have.

There is also an elevated count around the vfs_create() call in open_namei().
See the comments for more details, but we need this to fix a 'create, remount,
fail r/w open()' race.

Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0622753b 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com>

[PATCH] r/o bind mounts: elevate write count for rmdir and unlink.

Elevate the write count during the vfs_rmdir() and vfs_unlink().

[AV: merged rmdir and unlink parts, added missing pieces in nfsd]

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 56496c1d 08-Feb-2008 Pavel Emelyanov <xemul@openvz.org>

Pidns: fix badly converted mqueues pid handling

When sending the pid namespaces patches I wrongly converted the tsk->tgid into
task_pid_vnr(tsk) in mqueue-s (the git id of this patch is
b488893a390edfe027bae7a46e9af8083e740668).

The proper behavior is to get the task_tgid_vnr(tsk).

This seem to be the only mistake of that kind.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6c5f3e7b 08-Feb-2008 Pavel Emelyanov <xemul@openvz.org>

Pidns: make full use of xxx_vnr() calls

Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
_all_ converted to operate on the current pid namespace. After this each call
like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
one.

Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
appropriate.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd79b771 28-Nov-2007 Pavel Emelyanov <xemul@openvz.org>

ipc: lost unlock and fput in mqueue.c on error path

The error path in sys_mq_getsetattr() after the call to
audit_mq_getsetattr() is wrong - the info->lock is not unlocked and the
struct file *filp is not put.

Fix them both.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3d8d1e3 07-Nov-2007 Patrick McHardy <kaber@trash.net>

[NETLINK]: Fix unicast timeouts

Commit ed6dcf4a in the history.git tree broke netlink_unicast timeouts
by moving the schedule_timeout() call to a new function that doesn't
propagate the remaining timeout back to the caller. This means on each
retry we start with the full timeout again.

ipc/mqueue.c seems to actually want to wait indefinitely so this
behaviour is retained.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a190ae6 06-Jun-2007 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] pass dentry to audit_inode()/audit_inode_child()

makes caller simpler *and* allows to scan ancestors

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b488893a 19-Oct-2007 Pavel Emelyanov <xemul@openvz.org>

pid namespaces: changes to show virtual ids to user

This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97aeacf4 18-Oct-2007 Eric W. Biederman <ebiederm@xmission.com>

sysctl mqueue: remove the binary sysctl numbers

Because of a conflict with FS_INODE_NR none of the binary sysctl numbers use
by mqueue, were available to user space. So just remove them.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4ba9b9d0 17-Oct-2007 Christoph Lameter <clameter@sgi.com>

Slab API: remove useless ctor parameter and reorder parameters

Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.

Convert

ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ee015e0 10-Oct-2007 Denis V. Lunev <den@openvz.org>

[NET]: cleanup 3rd argument in netlink_sendskb

netlink_sendskb does not use third argument. Clean it and save a couple of
bytes.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 20c2df83 19-Jul-2007 Paul Mundt <lethal@linux-sh.org>

mm: Remove slab destructors from kmem_cache_create().

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>


# a35afb83 16-May-2007 Christoph Lameter <clameter@sgi.com>

Remove SLAB_CTOR_CONSTRUCTOR

SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4fc03b9b 13-Feb-2007 Amy Griffis <amy.griffis@hp.com>

[PATCH] complete message queue auditing

Handle the edge cases for POSIX message queue auditing. Collect inode
info when opening an existing mq, and for send/receive operations. Remove
audit_inode_update() as it has really evolved into the equivalent of
audit_inode().

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 50953fe9 06-May-2007 Christoph Lameter <clameter@sgi.com>

slab allocators: Remove SLAB_DEBUG_INITIAL flag

I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again? The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free. That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on. If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code. But there is no such code
in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e. add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches. Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support. Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7a434814 06-Mar-2007 Peter Zijlstra <a.p.zijlstra@chello.nl>

[PATCH] mqueue: nested locking annotation

Fix http://bugzilla.kernel.org/show_bug.cgi?id=8130

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b4d4147 14-Feb-2007 Eric W. Biederman <ebiederm@xmission.com>

[PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 92e1d5be 12-Feb-2007 Arjan van de Ven <arjan@linux.intel.com>

[PATCH] mark struct inode_operations const 2

Many struct inode_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9a32144e 12-Feb-2007 Arjan van de Ven <arjan@linux.intel.com>

[PATCH] mark struct file_operations const 7

Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d63079a 08-Dec-2006 Josef Sipek <jsipek@fsl.cs.sunysb.edu>

[PATCH] struct path: convert ipc

Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e18b890b 06-Dec-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] slab: remove kmem_cache_t

Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#

set -e

for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done

The script was run like this

sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e94b1766 06-Dec-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] slab: remove SLAB_KERNEL

SLAB_KERNEL is an alias of GFP_KERNEL.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f66e928b 03-Oct-2006 Michal Wronski <michal.wronski@gmail.com>

Michal Wronski: update contact info

My email has changed.

Signed-Off-By: Michal Wronski <michal.wronski@gmail.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# a03fcb73 02-Oct-2006 Cedric Le Goater <clg@fr.ibm.com>

[PATCH] update mq_notify to use a struct pid

Message queues can signal a process waiting for a message.

This patch replaces the pid_t value with a struct pid to avoid pid wrap
around problems.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d8c76e6f 01-Oct-2006 Dave Hansen <haveblue@us.ibm.com>

[PATCH] r/o bind mount prepwork: inc_nlink() helper

This is mostly included for parity with dec_nlink(), where we will have some
more hooks. This one should stay pretty darn straightforward for now.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 9a53c3a7 01-Oct-2006 Dave Hansen <haveblue@us.ibm.com>

[PATCH] r/o bind mounts: unlink: monitor i_nlink

When a filesystem decrements i_nlink to zero, it means that a write must be
performed in order to drop the inode from the filesystem.

We're shortly going to have keep filesystems from being remounted r/o between
the time that this i_nlink decrement and that write occurs.

So, add a little helper function to do the decrements. We'll tie into it in a
bit to note when i_nlink hits zero.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ba52de12 27-Sep-2006 Theodore Ts'o <tytso@mit.edu>

[PATCH] inode-diet: Eliminate i_blksize from the inode structure

This eliminates the i_blksize field from struct inode. Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1a1d92c1 27-Sep-2006 Alexey Dobriyan <adobriyan@gmail.com>

[PATCH] Really ignore kmem_cache_destroy return value

* Rougly half of callers already do it by not checking return value
* Code in drivers/acpi/osl.c does the following to be sure:

(void)kmem_cache_destroy(cache);

* Those who check it printk something, however, slab_error already printed
the name of failed cache.
* XFS BUGs on failed kmem_cache_destroy which is not the decision
low-level filesystem driver should make. Converted to ignore.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 75e1fcc0 23-Jun-2006 Miklos Szeredi <miklos@szeredi.hu>

[PATCH] vfs: add lock owner argument to flush operation

Pass the POSIX lock owner ID to the flush operation.

This is useful for filesystems which don't want to store any locking state
in inode->i_flock but want to handle locking/unlocking POSIX locks
internally. FUSE is one such filesystem but I think it possible that some
network filesystems would need this also.

Also add a flag to indicate that a POSIX locking request was generated by
close(), so filesystems using the above feature won't send an extra locking
request in this case.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 454e2398 23-Jun-2006 David Howells <dhowells@redhat.com>

[PATCH] VFS: Permit filesystem to override root dentry on mount

Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.

The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).

The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.

This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.

The patch also makes the following changes:

(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.

(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().

(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().

This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.

However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.

[*] Anonymous until discovered from another tree.

(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.

[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 20ca73bc 24-May-2006 George C. Wilson <ltcgcw@us.ibm.com>

[PATCH] Audit of POSIX Message Queue Syscalls v.2

This patch adds audit support to POSIX message queues. It applies cleanly to
the lspp.b15 branch of Al Viro's git tree. There are new auxiliary data
structures, and collection and emission routines in kernel/auditsc.c. New hooks
in ipc/mqueue.c collect arguments from the syscalls.

I tested the patch by building the examples from the POSIX MQ library tarball.
Build them -lrt, not against the old MQ library in the tarball. Here's the URL:
http://www.geocities.com/wronski12/posix_ipc/libmqueue-4.41.tar.gz
Do auditctl -a exit,always -S for mq_open, mq_timedsend, mq_timedreceive,
mq_notify, mq_getsetattr. mq_unlink has no new hooks. Please see the
corresponding userspace patch to get correct output from auditd for the new
record types.

[fixes folded]

Signed-off-by: George Wilson <ltcgcw@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7b7a317c 28-Mar-2006 Serge E. Hallyn <serue@us.ibm.com>

[PATCH] mqueue comment typo fix

(akpm: I don't do comment typos patches. This one snuck through by accident)

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 64bc0430 25-Mar-2006 Manfred Spraul <manfred@dbl.q-ag.de>

[PATCH] one ipc/sem.c->mutex.c converstion too many..

Ingo's sem2mutex patch incorrectly replaced one reference to ipc/sem.c
with ipc/mutex.c in a comment.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5f921ae9 26-Mar-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] sem2mutex: ipc, id.sem

Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a609164f 21-Mar-2006 Michal Wronski <Michal.Wronski@motorola.com>

Remove superfluous NOTIFY_COOKIE_LEN define

NOTIFY_COOKIE_LEN is defined in mqueue.h as well as mqueue.c
This patch removes redundant definition from mqueue.c

Signed-off-by: Michal Wronski <Michal.Wronski@motorola.com>
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# a70ea994 09-Feb-2006 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>

[NETLINK]: Fix a severe bug

netlink overrun was broken while improvement of netlink.
Destination socket is used in the place where it was meant to be source socket,
so that now overrun is never sent to user netlink sockets, when it should be,
and it even can be set on kernel socket, which results in complete deadlock
of rtnetlink.

Suggested fix is to restore status quo passing source socket as additional
argument to netlink_attachskb().

A little explanation: overrun is set on a socket, when it failed
to receive some message and sender of this messages does not or even
have no way to handle this error. This happens in two cases:
1. when kernel sends something. Kernel never retransmits and cannot
wait for buffer space.
2. when user sends a broadcast and the message was not delivered
to some recipients.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7c7dce92 14-Jan-2006 Alexander Viro <aviro@redhat.com>

[PATCH] Fix double decrement of mqueue_mnt->mnt_count in sys_mq_open

Fixed the refcounting on failure exits in sys_mq_open() and
cleaned the logics up. Rules are actually pretty simple - dentry_open()
expects vfsmount and dentry to be pinned down and it either transfers
them into created struct file or drops them. Old code had been very
confused in that area - if dentry_open() had failed either in do_open()
or do_create(), we ended up dentry and mqueue_mnt dropped twice, once
by dentry_open() cleanup and then by sys_mq_open().

Fix consists of making the rules for do_create() and do_open()
same as for dentry_open() and updating the sys_mq_open() accordingly;
that actually leads to more straightforward code and less work on
normal path.

Signed-off-by: Al Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c59ede7b 11-Jan-2006 Randy Dunlap <rdunlap@infradead.org>

[PATCH] move capable() to capability.h

- Move capable() from sched.h to capability.h;

- Use <linux/capability.h> where capable() is used
(in include/, block/, ipc/, kernel/, a few drivers/,
mm/, security/, & sound/;
many more drivers/ to go)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1b1dcc1b 09-Jan-2006 Jes Sorensen <jes@sgi.com>

[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem

This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.

Modified-by: Ingo Molnar <mingo@elte.hu>

(finished the conversion)

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 65163fd7 06-Nov-2005 Michal Wronski <Michal.Wronski@motorola.com>

Update Michal Wronski contact info


# 59175839 27-Sep-2005 Krzysztof Benedyczak <golbi@mat.uni.torun.pl>

[PATCH] Make POSIX message queue sys_mq_open() honor umask

We ignored umask when creating new queues via mq_open (when creating
with open() on mqueue fs it is ok of course). According to the
specification this a bug. This trivial patch fixes this.

Signed-off-by: Krzysztof Benedyczak <golbi@mat.uni.torun.pl>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 338cec32 10-Sep-2005 Adrian Bunk <bunk@stusta.de>

[PATCH] merge some from Rusty's trivial patches

This patch contains the most trivial from Rusty's trivial patches:
- spelling fixes
- remove duplicate includes

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7ed20e1a 01-May-2005 Jesper Juhl <juhl-lkml@dif.dk>

[PATCH] convert that currently tests _NSIG directly to use valid_signal()

Convert most of the current code that uses _NSIG directly to instead use
valid_signal(). This avoids gcc -W warnings and off-by-one errors.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d59dd462 01-May-2005 akpm@osdl.org <akpm@osdl.org>

[PATCH] use smp_mb/wmb/rmb where possible

Replace a number of memory barriers with smp_ variants. This means we won't
take the unnecessary hit on UP machines.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!