History log of /linux-master/ipc/sem.c
Revision Date Author Comments
# 6a4746ba 11-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

ipc: remove memcg accounting for sops objects in do_semtimedop()

Linus proposes to revert an accounting for sops objects in
do_semtimedop() because it's really just a temporary buffer
for a single semtimedop() system call.

This object can consume up to 2 pages, syscall is sleeping
one, size and duration can be controlled by user, and this
allocation can be repeated by many thread at the same time.

However Shakeel Butt pointed that there are much more popular
objects with the same life time and similar memory
consumption, the accounting of which was decided to be
rejected for performance reasons.

Considering at least 2 pages for task_struct and 2 pages for
the kernel stack, a back of the envelope calculation gives a
footprint amplification of <1.5 so this temporal buffer can be
safely ignored.

The factor would IMO be interesting if it was >> 2 (from the
PoV of excessive (ab)use, fine-grained accounting seems to be
currently unfeasible due to performance impact).

Link: https://lore.kernel.org/lkml/90e254df-0dfe-f080-011e-b7c53ee7fd20@virtuozzo.com/
Fixes: 18319498fdd4 ("memcg: enable accounting of ipc resources")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bdec0145 11-Aug-2021 Arnd Bergmann <arnd@arndb.de>

ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation

sys_oabi_semtimedop() is one of the last users of set_fs() on Arm. To
remove this one, expose the internal code of the actual implementation
that operates on a kernel pointer and call it directly after copying.

There should be no measurable impact on the normal execution of this
function, and it makes the overly long function a little shorter, which
may help readability.

While reworking the oabi version, make it behave a little more like
the native one, using kvmalloc_array() and restructure the code
flow in a similar way.

The naming of __do_semtimedop() is not very good, I hope someone can
come up with a better name.

One regression was spotted by kernel test robot <rong.a.chen@intel.com>
and fixed before the first mailing list submission.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>


# 18319498 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting of ipc resources

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Link: https://lkml.kernel.org/r/d6507b06-4df6-78f8-6c54-3ae86e3b5339@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 17d056e0 30-Jun-2021 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock

The patch solves three weaknesses in ipc/sem.c:

1) The initial read of use_global_lock in sem_lock() is an intentional
race. KCSAN detects these accesses and prints a warning.

2) The code assumes that plain C read/writes are not mangled by the CPU
or the compiler.

3) The comment it sysvipc_sem_proc_show() was hard to understand: The
rest of the comments in ipc/sem.c speaks about sem_perm.lock, and
suddenly this function speaks about ipc_lock_object().

To solve 1) and 2), use READ_ONCE()/WRITE_ONCE(). Plain C reads are used
in code that owns sma->sem_perm.lock.

The comment is updated to solve 3)

[manfred@colorfullife.com: use READ_ONCE()/WRITE_ONCE() for use_global_lock]
Link: https://lkml.kernel.org/r/20210627161919.3196-3-manfred@colorfullife.com

Link: https://lkml.kernel.org/r/20210514175319.12195-1-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: <1vier1@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fc37a3b8 30-Jun-2021 Vasily Averin <vvs@virtuozzo.com>

ipc sem: use kvmalloc for sem_undo allocation

Patch series "ipc: allocations cleanup", v2.

Some ipc objects use the wrong allocation functions: small objects can use
kmalloc(), and vice versa, potentially large objects can use kmalloc().

This patch (of 2):

Size of sem_undo can exceed one page and with the maximum possible nsems =
32000 it can grow up to 64Kb. Let's switch its allocation to kvmalloc to
avoid user-triggered disruptive actions like OOM killer in case of
high-order memory shortage.

User triggerable high order allocations are quite a problem on heavily
fragmented systems. They can be a DoS vector.

Link: https://lkml.kernel.org/r/ebc3ac79-3190-520d-81ce-22ad194986ec@virtuozzo.com
Link: https://lkml.kernel.org/r/a6354fd9-2d55-2e63-dd4d-fa7dc1d11134@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a11ddb37 22-May-2021 Varad Gautam <varad.gautam@suse.com>

ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7497835f 06-May-2021 Bhaskar Chowdhury <unixbhaskar@gmail.com>

ipc/sem.c: spelling fix

s/purpuse/purpose/

Link: https://lkml.kernel.org/r/20210319221432.26631-1-unixbhaskar@gmail.com
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b1989a3d 06-May-2021 Bhaskar Chowdhury <unixbhaskar@gmail.com>

ipc/sem.c: mundane typo fixes

s/runtine/runtime/
s/AQUIRE/ACQUIRE/
s/seperately/separately/
s/wont/won\'t/
s/succesfull/successful/

Link: https://lkml.kernel.org/r/20210326022240.26375-1-unixbhaskar@gmail.com
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# df561f66 23-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

treewide: Use fallthrough pseudo-keyword

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# 00898e85 11-Aug-2020 Alexey Dobriyan <adobriyan@gmail.com>

ipc: uninline functions

Two functions are only called via function pointers, don't bother
inlining them.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: http://lkml.kernel.org/r/20200710200312.GA960353@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# edf28f40 20-Feb-2020 Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>

Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"

This reverts commit a97955844807e327df11aa33869009d14d6b7de0.

Commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage
in exit_sem()") removes a lock that is needed. This leads to a process
looping infinitely in exit_sem() and can also lead to a crash. There is
a reproducer available in [1] and with the commit reverted the issue
does not reproduce anymore.

Using the reproducer found in [1] is fairly easy to reach a point where
one of the child processes is looping infinitely in exit_sem between
for(;;) and if (semid == -1) block, while it's trying to free its last
sem_undo structure which has already been freed by freeary().

Each sem_undo struct is on two lists: one per semaphore set (list_id)
and one per process (list_proc). The list_id list tracks undos by
semaphore set, and the list_proc by process.

Undo structures are removed either by freeary() or by exit_sem(). The
freeary function is invoked when the user invokes a syscall to remove a
semaphore set. During this operation freeary() traverses the list_id
associated with the semaphore set and removes the undo structures from
both the list_id and list_proc lists.

For this case, exit_sem() is called at process exit. Each process
contains a struct sem_undo_list (referred to as "ulp") which contains
the head for the list_proc list. When the process exits, exit_sem()
traverses this list to remove each sem_undo struct. As in freeary(),
whenever a sem_undo struct is removed from list_proc, it is also removed
from the list_id list.

Removing elements from list_id is safe for both exit_sem() and freeary()
due to sem_lock(). Removing elements from list_proc is not safe;
freeary() locks &un->ulp->lock when it performs
list_del_rcu(&un->list_proc) but exit_sem() does not (locking was
removed by commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list
lock usage in exit_sem()").

This can result in the following situation while executing the
reproducer [1] : Consider a child process in exit_sem() and the parent
in freeary() (because of semctl(sid[i], NSEM, IPC_RMID)).

- The list_proc for the child contains the last two undo structs A and
B (the rest have been removed either by exit_sem() or freeary()).

- The semid for A is 1 and semid for B is 2.

- exit_sem() removes A and at the same time freeary() removes B.

- Since A and B have different semid sem_lock() will acquire different
locks for each process and both can proceed.

The bug is that they remove A and B from the same list_proc at the same
time because only freeary() acquires the ulp lock. When exit_sem()
removes A it makes ulp->list_proc.next to point at B and at the same
time freeary() removes B setting B->semid=-1.

At the next iteration of for(;;) loop exit_sem() will try to remove B.

The only way to break from for(;;) is for (&un->list_proc ==
&ulp->list_proc) to be true which is not. Then exit_sem() will check if
B->semid=-1 which is and will continue looping in for(;;) until the
memory for B is reallocated and the value at B->semid is changed.

At that point, exit_sem() will crash attempting to unlink B from the
lists (this can be easily triggered by running the reproducer [1] a
second time).

To prove this scenario instrumentation was added to keep information
about each sem_undo (un) struct that is removed per process and per
semaphore set (sma).

CPU0 CPU1
[caller holds sem_lock(sma for A)] ...
freeary() exit_sem()
... ...
... sem_lock(sma for B)
spin_lock(A->ulp->lock) ...
list_del_rcu(un_A->list_proc) list_del_rcu(un_B->list_proc)

Undo structures A and B have different semid and sem_lock() operations
proceed. However they belong to the same list_proc list and they are
removed at the same time. This results into ulp->list_proc.next
pointing to the address of B which is already removed.

After reverting commit a97955844807 ("ipc,sem: remove uneeded
sem_undo_list lock usage in exit_sem()") the issue was no longer
reproducible.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1694779

Link: http://lkml.kernel.org/r/20191211191318.11860-1-ioanna-maria.alifieraki@canonical.com
Fixes: a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()")
Signed-off-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Herton R. Krzesinski <herton@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: <malat@debian.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8116b54e 03-Feb-2020 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: document and update memory barriers

Document and update the memory barriers in ipc/sem.c:

- Add smp_store_release() to wake_up_sem_queue_prepare() and
document why it is needed.

- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
as the pair for the barrier inside wake_up_sem_queue_prepare().

- Add comments to all barriers, and mention the rules in the block
regarding locking.

- Switch to using wake_q_add_safe().

Link: http://lkml.kernel.org/r/20191020123305.14715-6-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 984035ad 25-Sep-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

ipc/sem.c: convert to use built-in RCU list checking

CONFIG_PROVE_RCU_LIST requires list_for_each_entry_rcu() to pass a lockdep
expression if using srcu or locking for protection. It can only check
regular RCU protection, all other protection needs to be passed as lockdep
expression.

Link: http://lkml.kernel.org/r/20190830231817.76862-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Cc: Jonathan Derrick <jonathan.derrick@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4a2ae929 07-Mar-2019 Gustavo A. R. Silva <gustavo@embeddedor.com>

ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size

Use kvzalloc() instead of kvmalloc() and memset().

Also, make use of the struct_size() helper instead of the open-coded
version in order to avoid any potential type mistakes.

This code was detected with the help of Coccinelle.

Link: http://lkml.kernel.org/r/20190131214221.GA28930@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 667da6a2 07-Mar-2019 Mathieu Malaterre <malat@debian.org>

ipc: annotate implicit fall through

There is a plan to build the kernel with -Wimplicit-fallthrough and this
place in the code produced a warning (W=1).

This commit remove the following warning:

ipc/sem.c:1683:6: warning: this statement may fall through [-Wimplicit-fallthrough=]

Link: http://lkml.kernel.org/r/20190114203608.18218-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8dabe724 06-Jan-2019 Arnd Bergmann <arnd@arndb.de>

y2038: syscalls: rename y2038 compat syscalls

A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.

The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.

Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.

In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 275f2214 31-Dec-2018 Arnd Bergmann <arnd@arndb.de>

ipc: rename old-style shmctl/semctl/msgctl syscalls

The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.

For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.

The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.

As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.

A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 9afc5eee 12-Jul-2018 Arnd Bergmann <arnd@arndb.de>

y2038: globally rename compat_time to old_time32

Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:

Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.

The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:

old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()

As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.

I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.

This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 27c331a1 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc/util.c: further variable name cleanups

The varable names got a mess, thus standardize them again:

id: user space id. Called semid, shmid, msgid if the type is known.
Most functions use "id" already.
idx: "index" for the idr lookup
Right now, some functions use lid, ipc_addid() already uses idx as
the variable name.
seq: sequence number, to avoid quick collisions of the user space id
key: user space key, used for the rhash tree

Link: http://lkml.kernel.org/r/20180712185241.4017-12-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eae04d25 21-Aug-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc: simplify ipc initialization

Now that we know that rhashtable_init() will not fail, we can get rid of a
lot of the unnecessary cleanup paths when the call errored out.

[manfred@colorfullife.com: variable name added to util.h to resolve checkpatch warning]
Link: http://lkml.kernel.org/r/20180712185241.4017-11-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4241c1a3 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc: rename ipcctl_pre_down_nolock()

Both the comment and the name of ipcctl_pre_down_nolock() are misleading:
The function must be called while holdling the rw semaphore.

Therefore the patch renames the function to ipcctl_obtain_check(): This
name matches the other names used in util.c:

- "obtain" function look up a pointer in the idr, without
acquiring the object lock.
- The caller is responsible for locking.
- _check means that the sequence number is checked.

Link: http://lkml.kernel.org/r/20180712185241.4017-5-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 39cfffd7 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()

ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early, and by
modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
issue with reading kern_ipc_perm.seq, here both read and write to already
released memory could happen.

Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 615c999c 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc: compute kern_ipc_perm.id under the ipc lock

ipc_addid() initializes kern_ipc_perm.id after having called
ipc_idr_alloc().

Thus a parallel semctl() or msgctl() that uses e.g. MSG_STAT may use this
unitialized value as the return code.

The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

The issues is related to the finding of
syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
issue with kern_ipc_perm.seq

Link: http://lkml.kernel.org/r/20180712185241.4017-2-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f075faa3 26-Jul-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem.c: prevent queue.status tearing in semop

In order for load/store tearing prevention to work, _all_ accesses to
the variable in question need to be done around READ and WRITE_ONCE()
macros. Ensure everyone does so for q->status variable for
semtimedop().

Link: http://lkml.kernel.org/r/20180717052654.676-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0eb71a9d 17-Jun-2018 NeilBrown <neilb@suse.com>

rhashtable: split rhashtable.h

Due to the use of rhashtables in net namespaces,
rhashtable.h is included in lots of the kernel,
so a small changes can required a large recompilation.
This makes development painful.

This patch splits out rhashtable-types.h which just includes
the major type declarations, and does not include (non-trivial)
inline code. rhashtable.h is no longer included by anything
in the include/ directory.
Common include files only include rhashtable-types.h so a large
recompilation is only triggered when that changes.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec67aaa4 14-Jun-2018 Davidlohr Bueso <dave@stgolabs.net>

sysvipc/sem: mitigate semnum index against spectre v1

Both smatch and coverity are reporting potential issues with spectre
variant 1 with the 'semnum' index within the sma->sems array, ie:

ipc/sem.c:388 sem_lock() warn: potential spectre issue 'sma->sems'
ipc/sem.c:641 perform_atomic_semop_slow() warn: potential spectre issue 'sma->sems'
ipc/sem.c:721 perform_atomic_semop() warn: potential spectre issue 'sma->sems'

Avoid any possible speculation by using array_index_nospec() thus
ensuring the semnum value is bounded to [0, sma->sem_nsems). With the
exception of sem_lock() all of these are slowpaths.

Link: http://lkml.kernel.org/r/20180423171131.njs4rfm2yzyeg6do@linux-n805
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 344476e1 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kvmalloc() -> kvmalloc_array()

The kvmalloc() function has a 2-factor argument form, kvmalloc_array(). This
patch replaces cases of:

kvmalloc(a * b, gfp)

with:
kvmalloc_array(a * b, gfp)

as well as handling cases of:

kvmalloc(a * b * c, gfp)

with:

kvmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kvmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kvmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kvmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kvmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kvmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kvmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kvmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kvmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kvmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kvmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kvmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kvmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kvmalloc
+ kvmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kvmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kvmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kvmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kvmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kvmalloc(C1 * C2 * C3, ...)
|
kvmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kvmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kvmalloc(sizeof(THING) * C2, ...)
|
kvmalloc(sizeof(TYPE) * C2, ...)
|
kvmalloc(C1 * C2 * C3, ...)
|
kvmalloc(C1 * C2, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# b0d17578 13-Apr-2018 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Enable COMPAT_32BIT_TIME

Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
take a timespec argument. After we move 32-bit architectures over to
useing 64-bit time_t based syscalls, we need seperate entry points for
the old 32-bit based interfaces.

This changes the #ifdef guards for the existing 32-bit compat syscalls
to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
enabled on all existing 32-bit architectures.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 21fc538d 13-Apr-2018 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Use __kernel_timespec

This is a preparatation for changing over __kernel_timespec to 64-bit
times, which involves assigning new system call numbers for mq_timedsend(),
mq_timedreceive() and semtimedop() for compatibility with future y2038
proof user space.

The existing ABIs will remain available through compat code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# c2ab975c 28-Apr-2015 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Report long times to user space

The shmid64_ds/semid64_ds/msqid64_ds data structures have been extended
to contain extra fields for storing the upper bits of the time stamps,
this patch does the other half of the job and and fills the new fields on
32-bit architectures as well as 32-bit tasks running on a 64-bit kernel
in compat mode.

There should be no change for native 64-bit tasks.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 2a70b787 12-Apr-2018 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Use ktime_get_real_seconds consistently

In some places, we still used get_seconds() instead of
ktime_get_real_seconds(), and I'm changing the remaining ones now to
all use ktime_get_real_seconds() so we use the full available range for
timestamps instead of overflowing the 'unsigned long' return value in
year 2106 on 32-bit kernels.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# a280d6dc 10-Apr-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: introduce semctl(SEM_STAT_ANY)

There is a permission discrepancy when consulting shm ipc object
metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
command. The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the sma metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it. Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs). Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new SEM_STAT_ANY command such that the sem ipc
object permissions are ignored, and only audited instead. In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d969c6fa 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add semctl syscall/compat_syscall wrappers

Provide ksys_semctl() and compat_ksys_semctl() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_semctl() and compat_sys_semctl().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>


# 69894718 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add semget syscall wrapper

Provide ksys_semget() wrapper to avoid in-kernel calls to this syscall.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_semget().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>


# 41f4f0e2 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add semtimedop syscall/compat_syscall wrappers

Provide ksys_semtimedop() and compat_ksys_semtimedop() wrappers to avoid
in-kernel calls to these syscalls. The ksys_ prefix denotes that these
functions are meant as a drop-in replacement for the syscalls. In
particular, they use the same calling convention as sys_semtimedop() and
compat_sys_semtimedop().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>


# 50ab44b1 23-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

ipc: Directly call the security hook in ipc_ops.associate

After the last round of cleanups the shm, sem, and msg associate
operations just became trivial wrappers around the appropriate security
method. Simplify things further by just calling the security method
directly.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 51d6f263 23-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces

Today the last process to update a semaphore is remembered and
reported in the pid namespace of that process. If there are processes
in any other pid namespace querying that process id with GETPID the
result will be unusable nonsense as it does not make any
sense in your own pid namespace.

Due to ipc_update_pid I don't think you will be able to get System V
ipc semaphores into a troublesome cache line ping-pong. Using struct
pids from separate process are not a problem because they do not share
a cache line. Using struct pid from different threads of the same
process are unlikely to be a problem as the reference count update
can be avoided.

Further linux futexes are a much better tool for the job of mutual
exclusion between processes than System V semaphores. So I expect
programs that are performance limited by their interprocess mutual
exclusion primitive will be using futexes.

So while it is possible that enhancing the storage of the last
rocess of a System V semaphore from an integer to a struct pid
will cause a performance regression because of the effect
of frequently updating the pid reference count. I don't expect
that to happen in practice.

This change updates semctl(..., GETPID, ...) to return the
process id of the last process to update a semphore inthe
pid namespace of the calling process.

Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 1a5c1349 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

sem: Move struct sem and struct sem_array into ipc/sem.c

All of the users are now in ipc/sem.c so make the definitions
local to that file to make code maintenance easier. AKA
to prevent rebuilding the entire kernel when one of these
files is changed.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# aefad959 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks

All of the implementations of security hooks that take sem_array only
access sem_perm the struct kern_ipc_perm member. This means the
dependencies of the sem security hooks can be simplified by passing
the kern_ipc_perm member of sem_array.

Making this change will allow struct sem and struct sem_array
to become private to ipc/sem.c.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 87ad4b0d 06-Feb-2018 Philippe Mikoyan <philippe.mikoyan@skat.systems>

ipc: fix ipc data structures inconsistency

As described in the title, this patch fixes <ipc>id_ds inconsistency when
<ipc>ctl_stat executes concurrently with some ds-changing function, e.g.
shmat, msgsnd or whatever.

For instance, if shmctl(IPC_STAT) is running concurrently
with shmat, following data structure can be returned:
{... shm_lpid = 0, shm_nattch = 1, ...}

Link: http://lkml.kernel.org/r/20171202153456.6514-1-philippe.mikoyan@skat.systems
Signed-off-by: Philippe Mikoyan <philippe.mikoyan@skat.systems>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 39c96a1b 17-Nov-2017 Davidlohr Bueso <dave@stgolabs.net>

sysvipc: duplicate lock comments wrt ipc_addid()

The comment in msgqueues when using ipc_addid() is quite useful imo.
Duplicate it for shm and semaphores.

Link: http://lkml.kernel.org/r/20170831172049.14576-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 6aa211e8 25-Sep-2017 Linus Torvalds <torvalds@linux-foundation.org>

fix address space warnings in ipc/

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0cfb6aee 08-Sep-2017 Guillaume Knispel <guillaume.knispel@supersonicimagine.com>

ipc: optimize semget/shmget/msgget for lots of keys

ipc_findkey() used to scan all objects to look for the wanted key. This
is slow when using a high number of keys. This change adds an rhashtable
of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).

This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]

Other (more micro) benchmark results, by the author: On an i5 laptop, the
following loop executed right after a reboot took, without and with this
change:

for (int i = 0, k=0x424242; i < KEYS; ++i)
semget(k++, 1, IPC_CREAT | 0600);

total total max single max single
KEYS without with call without call with

1 3.5 4.9 µs 3.5 4.9
10 7.6 8.6 µs 3.7 4.7
32 16.2 15.9 µs 4.3 5.3
100 72.9 41.8 µs 3.7 4.7
1000 5,630.0 502.0 µs * *
10000 1,340,000.0 7,240.0 µs * *
31900 17,600,000.0 22,200.0 µs * *

*: unreliable measure: high variance

The duration for a lookup-only usage was obtained by the same loop once
the keys are present:

total total max single max single
KEYS without with call without call with

1 2.1 2.5 µs 2.1 2.5
10 4.5 4.8 µs 2.2 2.3
32 13.0 10.8 µs 2.3 2.8
100 82.9 25.1 µs * 2.3
1000 5,780.0 217.0 µs * *
10000 1,470,000.0 2,520.0 µs * *
31900 17,400,000.0 7,810.0 µs * *

Finally, executing each semget() in a new process gave, when still
summing only the durations of these syscalls:

creation:
total total
KEYS without with

1 3.7 5.0 µs
10 32.9 36.7 µs
32 125.0 109.0 µs
100 523.0 353.0 µs
1000 20,300.0 3,280.0 µs
10000 2,470,000.0 46,700.0 µs
31900 27,800,000.0 219,000.0 µs

lookup-only:
total total
KEYS without with

1 2.5 2.7 µs
10 25.4 24.4 µs
32 106.0 72.6 µs
100 591.0 352.0 µs
1000 22,400.0 2,250.0 µs
10000 2,510,000.0 25,700.0 µs
31900 28,200,000.0 115,000.0 µs

[1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop

Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix
Signed-off-by: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Reviewed-by: Marc Pardo <marc.pardo@supersonicimagine.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Cc: Marc Pardo <marc.pardo@supersonicimagine.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e4243b80 08-Sep-2017 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: play nicer with large nsops allocations

Replacing semop()'s kmalloc for kvmalloc was originally proposed by
Manfred on the premise that it can be called for large (than order-1)
sizes. For example, while Oracle recommends setting SEMOPM to a _minimum_
of 100, some distros[1] encourage the setting to be a factor of the amount
of db tasks (PROCESSES), which can get fishy for large systems (easily
going beyond 1000).

[1] An Example of Semaphore Settings
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-An_Example_of_Semaphore_Settings.html

So let's just convert this to kvmalloc, just like the rest of the
allocations we do in ipc. While the fallback vmalloc obviously involves
more overhead, this by far the uncommon path, and it's better for the user
than just erroring out with kmalloc.

Link: http://lkml.kernel.org/r/20170803184136.13855-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8419e64a 08-Sep-2017 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: drop sem_checkid helper

... 'tis not used.

Link: http://lkml.kernel.org/r/20170803184136.13855-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f74370b8 08-Sep-2017 Elena Reshetova <elena.reshetova@intel.com>

ipc: convert sem_undo_list.refcnt from atomic_t to refcount_t

refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This allows to avoid
accidental refcounter overflows that might lead to use-after-free
situations.

Link: http://lkml.kernel.org/r/1499417992-3238-3-git-send-email-elena.reshetova@intel.com
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: <arozansk@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e54d02b2 02-Aug-2017 Deepa Dinamani <deepa.kernel@gmail.com>

ipc: sem: Make sem_array timestamps y2038 safe

time_t is not y2038 safe. Replace all uses of
time_t by y2038 safe time64_t.

Similarly, replace the calls to get_seconds() with
y2038 safe ktime_get_real_seconds().
Note that this preserves fast access on 64 bit systems,
but 32 bit systems need sequence counters.

The syscall interface themselves are not changed as part of
the patch. They will be part of a different series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3ef56dc2 02-Aug-2017 Deepa Dinamani <deepa.kernel@gmail.com>

ipc: Make sys_semtimedop() y2038 safe

struct timespec is not y2038 safe on 32 bit machines.
Replace timespec with y2038 safe struct timespec64.

Note that the patch only changes the internals without
modifying the syscall interface. This will be part
of a separate series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e0892e08 29-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

ipc: Replace spin_unlock_wait() with lock/unlock pair

There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair. This commit therefore replaces the spin_unlock_wait() call in
exit_sem() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because exit_sem()
is rarely invoked in production.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Manfred Spraul <manfred@colorfullife.com>


# ade9f91b 02-Aug-2017 Kees Cook <keescook@chromium.org>

ipc: add missing container_of()s for randstruct

When building with the randstruct gcc plugin, the layout of the IPC
structs will be randomized, which requires any sub-structure accesses to
use container_of(). The proc display handlers were missing the needed
container_of()s since the iterator is passing in the top-level struct
kern_ipc_perm.

This would lead to crashes when running the "lsipc" program after the
system had IPC registered (e.g. after starting up Gnome):

general protection fault: 0000 [#1] PREEMPT SMP
...
RIP: 0010:shm_add_rss_swap.isra.1+0x13/0xa0
...
Call Trace:
sysvipc_shm_proc_show+0x5e/0x150
sysvipc_proc_show+0x1a/0x30
seq_read+0x2e9/0x3f0
...

Link: http://lkml.kernel.org/r/20170730205950.GA55841@beast
Fixes: 3859a271a003 ("randstruct: Mark various structs for randomization")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 44ee4546 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

semtimedop(): move compat to native

... and finally kill the sodding compat_convert_timespec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c0ebccb6 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

semctl(): move compat to native

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 45a4a64a 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

semctl(): separate all layout-dependent copyin/copyout

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e2029dfe 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/sem: drop __sem_free()

The remaining users of __sem_free() can simply call kvfree() instead for
better readability.

[manfred@colorfullife.com: Rediff to keep rcu protection for security_sem_alloc()]
Link: http://lkml.kernel.org/r/20170525185107.12869-20-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d3653f9 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc: move atomic_set() to where it is needed

Only after ipc_addid() has succeeded will refcounting be used, so move
initialization into ipc_addid() and remove from open-coded *_alloc()
routines.

Link: http://lkml.kernel.org/r/20170525185107.12869-17-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2ec55f80 12-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: avoid ipc_rcu_putref for failed ipc_addid()

Loosely based on a patch from Kees Cook <keescook@chromium.org>:
- id and retval can be merged
- if ipc_addid() fails, then use call_rcu() directly.

The difference is that call_rcu is used for failed ipc_addid() calls, to
continue to guaranteed an rcu delay for security_sem_free().

Link: http://lkml.kernel.org/r/20170525185107.12869-14-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 101ede01 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/sem: avoid ipc_rcu_alloc()

Instead of using ipc_rcu_alloc() which only performs the refcount bump,
open code it to perform better sem-specific checks. This also allows
for sem_array structure layout to be randomized in the future.

[manfred@colorfullife.com: Rediff, because the memset was temporarily inside ipc_rcu_alloc()]
Link: http://lkml.kernel.org/r/20170525185107.12869-10-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1b4654ef 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/sem: do not use ipc_rcu_free()

Avoid using ipc_rcu_free, since it just re-finds the original structure
pointer. For the pre-list-init failure path, there is no RCU needed,
since it was just allocated. It can be directly freed.

Link: http://lkml.kernel.org/r/20170525185107.12869-6-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f8dbe8d2 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc: drop non-RCU allocation

The only users of ipc_alloc() were ipc_rcu_alloc() and the on-heap
sem_io fall-back memory. Better to just open-code these to make things
easier to read.

[manfred@colorfullife.com: Rediff due to inclusion of memset() into ipc_rcu_alloc()]
Link: http://lkml.kernel.org/r/20170525185107.12869-5-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dba4cdd3 12-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

ipc: merge ipc_rcu and kern_ipc_perm

ipc has two management structures that exist for every id:
- struct kern_ipc_perm, it contains e.g. the permissions.
- struct ipc_rcu, it contains the rcu head for rcu handling and the
refcount.

The patch merges both structures.

As a bonus, we may save one cacheline, because both structures are
cacheline aligned. In addition, it reduces the number of casts, instead
most codepaths can use container_of.

To simplify code, the ipc_rcu_alloc initializes the allocation to 0.

[manfred@colorfullife.com: really include the memset() into ipc_alloc_rcu()]
Link: http://lkml.kernel.org/r/564f8612-0601-b267-514f-a9f650ec9b32@colorfullife.com
Link: http://lkml.kernel.org/r/20170525185107.12869-3-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a233956 12-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: remove sem_base, embed struct sem

sma->sem_base is initialized with

sma->sem_base = (struct sem *) &sma[1];

The current code has four problems:
- There is an unnecessary pointer dereference - sem_base is not needed.
- Alignment for struct sem only works by chance.
- The current code causes false positive for static code analysis.
- This is a cast between different non-void types, which the future
randstruct GCC plugin warns on.

And, as bonus, the code size gets smaller:

Before:
0 .text 00003770
After:
0 .text 0000374e

[manfred@colorfullife.com: s/[0]/[]/, per hch]
Link: http://lkml.kernel.org/r/20170525185107.12869-2-manfred@colorfullife.com
Link: http://lkml.kernel.org/r/20170515171912.6298-2-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: <1vier1@web.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 84f001e1 01-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/wake_q.h>

We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/wake_q.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9de5ab8a 27-Feb-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem: add hysteresis

sysv sem has two lock modes: One with per-semaphore locks, one lock mode
with a single global lock for the whole array. When switching from the
per-semaphore locks to the global lock, all per-semaphore locks must be
scanned for ongoing operations.

The patch adds a hysteresis for switching from the global lock to the
per semaphore locks. This reduces how often the per-semaphore locks
must be scanned.

Compared to the initial patch, this is a simplified solution: Setting
USE_GLOBAL_LOCK_HYSTERESIS to 1 restores the current behavior.

In theory, a workload with exactly 10 simple sops and then one complex
op now scales a bit worse, but this is pure theory: If there is
concurrency, the it won't be exactly 10:1:10:1:10:1:... If there is no
concurrency, then there is no need for scalability.

Link: http://lkml.kernel.org/r/1476851896-3590-3-git-send-email-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: <1vier1@web.de>
Cc: kernel test robot <xiaolong.ye@intel.com>
Cc: <felixh@informatik.uni-bremen.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 27d7be18 27-Feb-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: avoid using spin_unlock_wait()

a) The ACQUIRE in spin_lock() applies to the read, not to the store, at
least for powerpc. This forces to add a smp_mb() into the fast path.

b) The memory barrier provided by spin_unlock_wait() is right now arch
dependent.

Therefore: Use spin_lock()/spin_unlock() instead of spin_unlock_wait().

Advantage: faster single op semop calls(), observed +8.9% on x86. (the
other solution would be arch dependencies in ipc/sem).

Disadvantage: slower complex op semop calls, if (and only if) there are
no sleeping operations.

The next patch adds hysteresis, this further reduces the probability
that the slow path is used.

Link: http://lkml.kernel.org/r/1476851896-3590-2-git-send-email-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: <1vier1@web.de>
Cc: kernel test robot <xiaolong.ye@intel.com>
Cc: <felixh@informatik.uni-bremen.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c626bc46 10-Jan-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix incorrect sem_lock pairing

Based on the syzcaller test case from dvyukov:

https://gist.githubusercontent.com/dvyukov/d0e5efefe4d7d6daed829f5c3ca26a40/raw/08d0a261fe3c987bed04fbf267e08ba04bd533ea/gistfile1.txt

The slow (i.e.: failure to acquire) syscall exit from semtimedop()
incorrectly assumed that the the same lock is acquired as it was at the
initial syscall entry.

This is wrong:
- thread A: single semop semop(), sleeps
- thread B: multi semop semop(), sleeps
- thread A: woken up by signal/timeout

With this sequence, the initial sem_lock() call locks the per-semaphore
spinlock, and it is unlocked with sem_unlock(). The call at the syscall
return locks the global spinlock. Because locknum is not updated, the
following sem_unlock() call unlocks the per-semaphore spinlock, which is
actually not locked.

The fix is trivial: Use the return value from sem_lock.

Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop")
Link: http://lkml.kernel.org/r/1482215645-22328-1-git-send-email-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Johanna Abrahamsson <johanna@mjao.org>
Tested-by: Johanna Abrahamsson <johanna@mjao.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 370b262c 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: avoid idr tree lookup for interrupted semop

We can avoid the idr tree lookup (albeit possibly avoiding
idr_find_fast()) when being awoken in EINTR, as the semid will not
change in this context while blocked. Use the sma pointer directly and
take the sem_lock, then re-check for RMID races. We continue to
re-check the queue.status with the lock held such that we can detect
situations where we where are dealing with a spurious wakeup but another
task that holds the sem_lock updated the queue.status while we were
spinning for it. Once we take the lock it obviously won't change again.

Being the only caller, get rid of sem_obtain_lock() altogether.

Link: http://lkml.kernel.org/r/1478708774-28826-3-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b5fa01a2 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: simplify wait-wake loop

Instead of using the reverse goto, we can simplify the flow and make it
more language natural by just doing do-while instead. One would hope
this is the standard way (or obviously just with a while bucle) that we
do wait/wakeup handling in the kernel. The exact same logic is kept,
just more indented.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1478708774-28826-2-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f150f02c 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: use proper list api for pending_list wakeups

... saves some LoC and looks cleaner than re-implementing the calls.

Link: http://lkml.kernel.org/r/1474225896-10066-6-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4663d3e8 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: explicitly inline check_restart

The compiler already does this, but make it explicit. This helper is
really small and also used in update_queue's main loop, which is O(N^2)
scanning. Inline and avoid the function overhead.

Link: http://lkml.kernel.org/r/1474225896-10066-5-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4ce33ec2 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: optimize perform_atomic_semop()

This is the main workhorse that deals with semop user calls such that
the waitforzero or semval update operations, on the set, can complete on
not as the sma currently stands. Currently, the set is iterated twice
(setting semval, then backwards for the sempid value). Slowpaths, and
particularly SEM_UNDO calls, must undo any altered sem when it is
detected that the caller must block or has errored-out.

With larger sets, there can occur situations where this involves a lot
of cycles and can obviously be a suboptimal use of cached resources in
shared memory. Ie, discarding CPU caches that are also calling semop
and have the sembuf cached (and can complete), while the current lock
holder doing the semop will block, error, or does a waitforzero
operation.

This patch proposes still iterating the set twice, but the first scan is
read-only, and we perform the actual updates afterward, once we know
that the call will succeed. In order to not suffer from the overhead of
dealing with sops that act on the same sem_num, such (rare) cases use
perform_atomic_semop_slow(), which is exactly what we have now.
Duplicates are detected before grabbing sem_lock, and uses simple a
32/64-bit hash array variable to based on the sem_num we are working on.

In addition add some comments to when we expect to the caller to block.

[akpm@linux-foundation.org: coding-style fixes]
[colin.king@canonical.com: ensure we left shift a ULL rather than a 32 bit integer]
Link: http://lkml.kernel.org/r/20161028181129.7311-1-colin.king@canonical.com
Link: http://lkml.kernel.org/r/20160921194603.GB21438@linux-80c1.suse
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9ae949fa 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: rework task wakeups

Our sysv sems have been using the notion of lockless wakeups for a
while, ever since commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process
out of the spinlock section"), in order to reduce the sem_lock hold
times. This in-house pending queue can be replaced by wake_q (just like
all the rest of ipc now), in that it provides the following advantages:

o Simplifies and gets rid of unnecessary code.

o We get rid of the IN_WAKEUP complexities. Given that wake_q_add()
grabs reference to the task, if awoken due to an unrelated event,
between the wake_q_add() and wake_up_q() window, we cannot race with
sys_exit and the imminent call to wake_up_process().

o By not spinning IN_WAKEUP, we no longer need to disable preemption.

In consequence, the wakeup paths (after schedule(), that is) must
acknowledge an external signal/event, as well spurious wakeup occurring
during the pending wakeup window. Obviously no changes in semantics
that could be visible to the user. The fastpath is _only_ for when we
know for sure that we were awoken due to a the waker's successful semop
call (queue.status is not -EINTR).

On a 48-core Haswell, running the ipcscale 'waitforzero' test, the
following is seen with increasing thread counts:

v4.8-rc5 v4.8-rc5
semopv2
Hmean sembench-sem-2 574733.00 ( 0.00%) 578322.00 ( 0.62%)
Hmean sembench-sem-8 811708.00 ( 0.00%) 824689.00 ( 1.59%)
Hmean sembench-sem-12 842448.00 ( 0.00%) 845409.00 ( 0.35%)
Hmean sembench-sem-21 933003.00 ( 0.00%) 977748.00 ( 4.80%)
Hmean sembench-sem-48 935910.00 ( 0.00%) 1004759.00 ( 7.36%)
Hmean sembench-sem-79 937186.00 ( 0.00%) 983976.00 ( 4.99%)
Hmean sembench-sem-234 974256.00 ( 0.00%) 1060294.00 ( 8.83%)
Hmean sembench-sem-265 975468.00 ( 0.00%) 1016243.00 ( 4.18%)
Hmean sembench-sem-296 991280.00 ( 0.00%) 1042659.00 ( 5.18%)
Hmean sembench-sem-327 975415.00 ( 0.00%) 1029977.00 ( 5.59%)
Hmean sembench-sem-358 1014286.00 ( 0.00%) 1049624.00 ( 3.48%)
Hmean sembench-sem-389 972939.00 ( 0.00%) 1043127.00 ( 7.21%)
Hmean sembench-sem-420 981909.00 ( 0.00%) 1056747.00 ( 7.62%)
Hmean sembench-sem-451 990139.00 ( 0.00%) 1051609.00 ( 6.21%)
Hmean sembench-sem-482 965735.00 ( 0.00%) 1040313.00 ( 7.72%)

[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: merge fix for WAKE_Q to DEFINE_WAKE_Q rename]
Link: http://lkml.kernel.org/r/20161122210410.5eca9fc2@canb.auug.org.au
Link: http://lkml.kernel.org/r/1474225896-10066-3-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 248e7357 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: do not call wake_sem_queue_do() prematurely ... as this call should obviously be paired with its _prepare()

counterpart. At least whenever possible, as there is no harm in calling
it bogusly as we do now in a few places. Immediate error semop(2) paths
that are far from ever having the task block can be simplified and avoid
a few unnecessary loads on their way out of the call as it is not deeply
nested.

Link: http://lkml.kernel.org/r/1474225896-10066-2-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a1613a5 11-Oct-2016 Nikolay Borisov <kernel@kyup.com>

ipc/sem.c: add cond_resched in exit_sme

In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in
exit_sem. Apparently it's possible for the loop to take quite a long time
and it doesn't have a scheduling point in it. Since the codes is
executing under an rcu read section this may also cause rcu stalls, which
in turn block synchronize_rcu operations, which more or less de-stabilises
the whole system.

Fix this by introducing a cond_resched() at the beginning of the loop.

So this patch fixes the following:

NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119]
CPU: 10 PID: 18119 Comm: httpd Tainted: G O 4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000
RIP: 0010:[<ffffffff81614bc7>] [<ffffffff81614bc7>] _raw_spin_lock+0x17/0x30
RSP: 0018:ffff881c95553e40 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d
RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4
RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88
R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0
R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0
FS: 0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0
Call Trace:
? exit_sem+0x7c/0x280
do_exit+0x338/0xb40
do_group_exit+0x43/0xd0
SyS_exit_group+0x14/0x20
entry_SYSCALL_64_fastpath+0x16/0x6e

Link: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.com
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5864a2fd 11-Oct-2016 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix complex_count vs. simple op race

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

The patch is safe for backporting, i.e. it makes no assumptions
about memory barriers in spin_unlock_wait().

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
Reported-by: <felixh@informatik.uni-bremen.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9b24fef9 02-Aug-2016 Fabian Frederick <fabf@skynet.be>

sysv, ipc: fix security-layer leaking

Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
to receive rcu freeing function but used generic ipc_rcu_free() instead
of msg_rcu_free() which does security cleaning.

Running LTP msgsnd06 with kmemleak gives the following:

cat /sys/kernel/debug/kmemleak

unreferenced object 0xffff88003c0a11f8 (size 8):
comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s)
hex dump (first 8 bytes):
1b 00 00 00 01 00 00 00 ........
backtrace:
kmemleak_alloc+0x23/0x40
kmem_cache_alloc_trace+0xe1/0x180
selinux_msg_queue_alloc_security+0x3f/0xd0
security_msg_queue_alloc+0x2e/0x40
newque+0x4e/0x150
ipcget+0x159/0x1b0
SyS_msgget+0x39/0x40
entry_SYSCALL_64_fastpath+0x13/0x8f

Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to
only use ipc_rcu_free in case of security allocation failure in newary()

Fixes: 53dad6d3a8e ("ipc: fix race with LSMs")
Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be3e7844 24-May-2016 Peter Zijlstra <peterz@infradead.org>

locking/spinlock: Update spin_unlock_wait() users

With the modified semantics of spin_unlock_wait() a number of
explicit barriers can be removed. Also update the comment for the
do_exit() usecase, as that was somewhat stale/obscure.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 33ac2796 24-May-2016 Peter Zijlstra <peterz@infradead.org>

locking/barriers: Introduce smp_acquire__after_ctrl_dep()

Introduce smp_acquire__after_ctrl_dep(), this construct is not
uncommon, but the lack of this barrier is.

Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
semaphore code and the qspinlock code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a5f4db87 22-Mar-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: make semctl setting sempid consistent

As indicated by bug#112271, Linux sets the sempid value upon semctl, and
not only for semop calls. However, within semctl we only do this for
SETVAL, leaving SETALL without updating the field, and therefore rather
inconsistent behavior when compared to other Unices.

There is really no documentation regarding this and therefore users
should not make assumptions. With this patch, along with updating
semctl.2 manpages, this scenario should become less ambiguous As such,
set sempid on SETALL cmd.

Also update some in-code documentation, specifying where the sempid is
set.

Passes ltp and custom testcase where a child (fork) does SETALL to the
set.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Philip Semanchuk <linux_kernel.20.ick@spamgourmet.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: PrasannaKumar Muralidharan <prasannatsmkumar@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1d5cfdb0 22-Jan-2016 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>

tree wide: use kvfree() than conditional kfree()/vfree()

There are many locations that do

if (memory_was_allocated_by_vmalloc)
vfree(ptr);
else
kfree(ptr);

but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Boris Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ed1f8a9 14-Aug-2015 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: update/correct memory barriers

sem_lock() did not properly pair memory barriers:

!spin_is_locked() and spin_unlock_wait() are both only control barriers.
The code needs an acquire barrier, otherwise the cpu might perform read
operations before the lock test.

As no primitive exists inside <include/spinlock.h> and since it seems
noone wants another primitive, the code creates a local primitive within
ipc/sem.c.

With regards to -stable:

The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
nop (just a preprocessor redefinition to improve the readability). The
bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
starting from 3.10).

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9795584 14-Aug-2015 Herton R. Krzesinski <herton@redhat.com>

ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()

After we acquire the sma->sem_perm lock in exit_sem(), we are protected
against a racing IPC_RMID operation. Also at that point, we are the last
user of sem_undo_list. Therefore it isn't required that we acquire or use
ulp->lock.

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Rafael Aquini <aquini@redhat.com>
CC: Aristeu Rozanski <aris@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 602b8593 14-Aug-2015 Herton R. Krzesinski <herton@redhat.com>

ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits

The current semaphore code allows a potential use after free: in
exit_sem we may free the task's sem_undo_list while there is still
another task looping through the same semaphore set and cleaning the
sem_undo list at freeary function (the task called IPC_RMID for the same
semaphore set).

For example, with a test program [1] running which keeps forking a lot
of processes (which then do a semop call with SEM_UNDO flag), and with
the parent right after removing the semaphore set with IPC_RMID, and a
kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and
CONFIG_DEBUG_SPINLOCK, you can easily see something like the following
in the kernel log:

Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64
000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk
010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........
Prev obj: start=ffff88003b45c180, len=64
000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7....
Next obj: start=ffff88003b45c200, len=64
000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h).<....
BUG: spinlock wrong CPU on CPU#2, test/18028
general protection fault: 0000 [#1] SMP
Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
CPU: 2 PID: 18028 Comm: test Not tainted 4.2.0-rc5+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
RIP: spin_dump+0x53/0xc0
Call Trace:
spin_bug+0x30/0x40
do_raw_spin_unlock+0x71/0xa0
_raw_spin_unlock+0xe/0x10
freeary+0x82/0x2a0
? _raw_spin_lock+0xe/0x10
semctl_down.clone.0+0xce/0x160
? __do_page_fault+0x19a/0x430
? __audit_syscall_entry+0xa8/0x100
SyS_semctl+0x236/0x2c0
? syscall_trace_leave+0xde/0x130
entry_SYSCALL_64_fastpath+0x12/0x71
Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2c a4 81 31 c0 65 8b 15 eb 40 f3 7e e8 08 31 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
RIP [<ffffffff810d6053>] spin_dump+0x53/0xc0
RSP <ffff88003750fd68>
---[ end trace 783ebb76612867a0 ]---
NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053]
Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
RIP: native_read_tsc+0x0/0x20
Call Trace:
? delay_tsc+0x40/0x70
__delay+0xf/0x20
do_raw_spin_lock+0x96/0x140
_raw_spin_lock+0xe/0x10
sem_lock_and_putref+0x11/0x70
SYSC_semtimedop+0x7bf/0x960
? handle_mm_fault+0xbf6/0x1880
? dequeue_task_fair+0x79/0x4a0
? __do_page_fault+0x19a/0x430
? kfree_debugcheck+0x16/0x40
? __do_page_fault+0x19a/0x430
? __audit_syscall_entry+0xa8/0x100
? do_audit_syscall_entry+0x66/0x70
? syscall_trace_enter_phase1+0x139/0x160
SyS_semtimedop+0xe/0x10
SyS_semop+0x10/0x20
entry_SYSCALL_64_fastpath+0x12/0x71
Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 <55> 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9
Kernel panic - not syncing: softlockup: hung tasks

I wasn't able to trigger any badness on a recent kernel without the
proper config debugs enabled, however I have softlockup reports on some
kernel versions, in the semaphore code, which are similar as above (the
scenario is seen on some servers running IBM DB2 which uses semaphore
syscalls).

The patch here fixes the race against freeary, by acquiring or waiting
on the sem_undo_list lock as necessary (exit_sem can race with freeary,
while freeary sets un->semid to -1 and removes the same sem_undo from
list_proc or when it removes the last sem_undo).

After the patch I'm unable to reproduce the problem using the test case
[1].

[1] Test case used below:

#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/wait.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <errno.h>

#define NSEM 1
#define NSET 5

int sid[NSET];

void thread()
{
struct sembuf op;
int s;
uid_t pid = getuid();

s = rand() % NSET;
op.sem_num = pid % NSEM;
op.sem_op = 1;
op.sem_flg = SEM_UNDO;

semop(sid[s], &op, 1);
exit(EXIT_SUCCESS);
}

void create_set()
{
int i, j;
pid_t p;
union {
int val;
struct semid_ds *buf;
unsigned short int *array;
struct seminfo *__buf;
} un;

/* Create and initialize semaphore set */
for (i = 0; i < NSET; i++) {
sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
if (sid[i] < 0) {
perror("semget");
exit(EXIT_FAILURE);
}
}
un.val = 0;
for (i = 0; i < NSET; i++) {
for (j = 0; j < NSEM; j++) {
if (semctl(sid[i], j, SETVAL, un) < 0)
perror("semctl");
}
}

/* Launch threads that operate on semaphore set */
for (i = 0; i < NSEM * NSET * NSET; i++) {
p = fork();
if (p < 0)
perror("fork");
if (p == 0)
thread();
}

/* Free semaphore set */
for (i = 0; i < NSET; i++) {
if (semctl(sid[i], NSEM, IPC_RMID))
perror("IPC_RMID");
}

/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}

int main(int argc, char **argv)
{
pid_t p;

srand(time(NULL));

while (1) {
p = fork();
if (p < 0) {
perror("fork");
exit(EXIT_FAILURE);
}
if (p == 0) {
create_set();
goto end;
}

/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}
end:
return 0;
}

[akpm@linux-foundation.org: use normal comment layout]
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Rafael Aquini <aquini@redhat.com>
CC: Aristeu Rozanski <aris@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 55b7ae50 30-Jun-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc: rename ipc_obtain_object

... to ipc_obtain_object_idr, which is more meaningful and makes the code
slightly easier to follow.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7f032d6e 15-Apr-2015 Joe Perches <joe@perches.com>

ipc: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52644c9a 17-Feb-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc,sem: use current->state helpers

Call __set_current_state() instead of assigning the new state directly.
These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments, keeping
track of who changed the state.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2e094abf 12-Dec-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()

When I fixed bugs in the sem_lock() logic, I was more conservative than
necessary. Therefore it is safe to replace the smp_mb() with smp_rmb().
And: With smp_rmb(), semop() syscalls are up to 10% faster.

The race we must protect against is:

sem->lock is free
sma->complex_count = 0
sma->sem_perm.lock held by thread B

thread A:

A: spin_lock(&sem->lock)

B: sma->complex_count++; (now 1)
B: spin_unlock(&sma->sem_perm.lock);

A: spin_is_locked(&sma->sem_perm.lock);
A: XXXXX memory barrier
A: if (sma->complex_count == 0)

Thread A must read the increased complex_count value, i.e. the read must
not be reordered with the read of sem_perm.lock done by spin_is_locked().

Since it's about ordering of reads, smp_rmb() is sufficient.

[akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e8577d1f 02-Dec-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fully initialize sem_array before making it visible

ipc_addid() makes a new ipc identifier visible to everyone. New objects
start as locked, so that the caller can complete the initialization
after the call. Within struct sem_array, at least sma->sem_base and
sma->sem_nsems are accessed without any locks, therefore this approach
doesn't work.

Thus: Move the ipc_addid() to the end of the initialization.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Rik van Riel <riel@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9b44ee2e 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: add a printk_once for semctl(GETNCNT/GETZCNT)

The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT)
always (since 0.99.10) reported a thread as sleeping on all semaphores
that are listed in the semop() call.

The documented behavior (both in the Linux man page and in the Single
Unix Specification) is that a task should be reported on exactly one
semaphore: The semaphore that caused the thread to got to sleep.

This patch adds a pr_info_once() that is triggered if a thread hits the
relevant case.

The code triggers slightly too often, otherwise it would be necessary to
replicate the old code. As there are no known users of GETNCNT or
GETZCNT, this is done to prevent unnecessary bloat.

The task that triggered is reported with name (tsk->comm) and pid.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b220c57a 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: make semctl(,,{GETNCNT,GETZCNT}) standard compliant

SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task
waits on exactly one semaphore: The semaphore from the first operation
in the sop array that cannot proceed.

The Linux implementation never followed the standard, it tried to count
all semaphores that might be the reason why a task sleeps.

This patch fixes that.

Note:
a) The implementation assumes that GETNCNT and GETZCNT are rare operations,
therefore the code counts them only on demand.
(If they wouldn't be rare, then the non-compliance would have
been found earlier)

b) compared to the initial version of the patch, the BUG_ONs were removed
and it was clarified that the new behavior conforms to SUS.

Back-compatibility concerns:

Manfred:

: - there is no application in Fedora that uses GETNCNT or GETZCNT.
:
: - application that use only single-sop semop() are also safe, the
: difference only affects complex apps.
:
: - portable application are also safe, the new behavior is standard
: compliant.
:
: But that's it. The old behavior existed in Linux from 0.99.something
: until now.

Michael:

: * These operations seem to be very little used. Grepping the public
: source that is contained Fedora 20 source DVD, there appear to be no
: uses. Of course, this says nothing about uses in private /
: non-mainstream FOSS code, but it seems likely that the same pattern
: is followed there.
:
: * The existing behavior is hard enough to understand that I suspect
: that no one understood it well enough to rely on it anyway
: (especially as that behavior contradicted both man page and POSIX).
:
: So, there's a chance of breakage, but I estimate that it's minute.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed247b7c 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: store which operation blocks in perform_atomic_semop()

Preparation for the next patch:

In the slow-path of perform_atomic_semop(), store a pointer to the
operation that caused the operation to block.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d198cd6d 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: change perform_atomic_semop parameters

Right now, perform_atomic_semop gets the content of sem_queue as
individual fields. Changes that, instead pass a pointer to sem_queue.

This is a preparation for the next patch: it uses sem_queue to store the
reason why a task must sleep.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2f2ed41d 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: remove code duplication

count_semzcnt and count_semncnt are more of less identical. The patch
creates a single function that either counts the number of tasks waiting
for zero or waiting due to a decrease operation.

Compared to the initial version, the BUG_ONs were removed.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1994862d 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: bugfix for semctl(,,GETZCNT)

GETZCNT is supposed to return the number of threads that wait until a
semaphore value becomes 0.

The current implementation overlooks complex operations that contain
both wait-for-zero operation and operations that alter at least one
semaphore.

The patch fixes that. It's intentionally copy&paste, this will be
cleaned up in the next patch.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 46c0a8ca 06-Jun-2014 Paul McQuade <paulmcquad@gmail.com>

ipc, kernel: clear whitespace

trailing whitespace

Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7153e402 06-Jun-2014 Paul McQuade <paulmcquad@gmail.com>

ipc, kernel: use Linux headers

Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
Use #include <linux/types.h> instead of <asm/types.h>

Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eb66ec44 06-Jun-2014 Mathias Krause <minipli@googlemail.com>

ipc: constify ipc_ops

There is no need to recreate the very same ipc_ops structure on every
kernel entry for msgget/semget/shmget. Just declare it static and be
done with it. While at it, constify it as we don't modify the structure
at runtime.

Found in the PaX patch, written by the PaX Team.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ab08fe2 27-Jan-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: remove braces for single statements

Deal with checkpatch messages:
WARNING: braces {} are not necessary for single statement blocks

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8001c858 27-Jan-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: standardize code comments

IPC commenting style is all over the place, *specially* in util.c. This
patch orders things a bit.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 239521f3 27-Jan-2014 Manfred Spraul <manfred@colorfullife.com>

ipc: whitespace cleanup

The ipc code does not adhere the typical linux coding style.
This patch fixes lots of simple whitespace errors.

- mostly autogenerated by
scripts/checkpatch.pl -f --fix \
--types=pointer_location,spacing,space_before_tab
- one manual fixup (keep structure members tab-aligned)
- removal of additional space_before_tab that were not found by --fix

Tested with some of my msg and sem test apps.

Andrew: Could you include it in -mm and move it towards Linus' tree?

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Suggested-by: Li Bin <huawei.libin@huawei.com>
Cc: Joe Perches <joe@perches.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72a8ff2f 27-Jan-2014 Rafael Aquini <aquini@redhat.com>

ipc: change kern_ipc_perm.deleted type to bool

struct kern_ipc_perm.deleted is meant to be used as a boolean toggle, and
the changes introduced by this patch are just to make the case explicit.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0f3d2b01 27-Jan-2014 Rafael Aquini <aquini@redhat.com>

ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races

After the locking semantics for the SysV IPC API got improved, a couple
of IPC_RMID race windows were opened because we ended up dropping the
'kern_ipc_perm.deleted' check performed way down in ipc_lock(). The
spotted races got sorted out by re-introducing the old test within the
racy critical sections.

This patch introduces ipc_valid_object() to consolidate the way we cope
with IPC_RMID races by using the same abstraction across the API
implementation.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78f5009c 27-Jan-2014 Petr Mladek <pmladek@suse.cz>

ipc/sem.c: avoid overflow of semop undo (semadj) value

When trying to understand semop code, I found a small mistake in the check
for semadj (undo) value overflow. The new undo value is not stored
immediately and next potential checks are done against the old value.

The failing scenario is not much practical. One semop call has to do more
operations on the same semaphore. Also semval and semadj must have
different values, so there has to be some operations without SEM_UNDO
flag. For example:

struct sembuf depositor_op[1];
struct sembuf collector_op[2];

depositor_op[0].sem_num = 0;
depositor_op[0].sem_op = 20000;
depositor_op[0].sem_flg = 0;

collector_op[0].sem_num = 0;
collector_op[0].sem_op = -10000;
collector_op[0].sem_flg = SEM_UNDO;
collector_op[1].sem_num = 0;
collector_op[1].sem_op = -10000;
collector_op[1].sem_flg = SEM_UNDO;

if (semop(semid, depositor_op, 1) == -1)
{ perror("Failed to do 1st deposit"); return 1; }

if (semop(semid, collector_op, 2) == -1)
{ perror("Failed to do 1st collect"); return 1; }

if (semop(semid, depositor_op, 1) == -1)
{ perror("Failed to do 2nd deposit"); return 1; }

if (semop(semid, collector_op, 2) == -1)
{ perror("Failed to do 2nd collect"); return 1; }

return 0;

It passes without error now but the semadj value has overflown in the 2nd
collector operation.

[akpm@linux-foundation.org: restore lessened scope of local `undo']
[davidlohr@hp.com: correct header comment for perform_atomic_semop]
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6e224f94 16-Oct-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: synchronize semop and semctl with IPC_RMID

After acquiring the semlock spinlock, operations must test that the
array is still valid.

- semctl() and exit_sem() would walk stale linked lists (ugly, but
should be ok: all lists are empty)

- semtimedop() would sleep forever - and if woken up due to a signal -
access memory after free.

The patch also:
- standardizes the tests for .deleted, so that all tests in one
function leave the function with the same approach.
- unconditionally tests for .deleted immediately after every call to
sem_lock - even it it means that for semctl(GETALL), .deleted will be
tested twice.

Both changes make the review simpler: After every sem_lock, there must
be a test of .deleted, followed by a goto to the cleanup code (if the
function uses "goto cleanup").

The only exception is semctl_down(): If sem_ids().rwsem is locked, then
the presence in ids->ipcs_idr is equivalent to !.deleted, thus no
additional test is required.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0e8c6656 30-Sep-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: update sem_otime for all operations

In commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the
spinlock section"), the update of semaphore's sem_otime(last semop time)
was moved to one central position (do_smart_update).

But since do_smart_update() is only called for operations that modify
the array, this means that wait-for-zero semops do not update sem_otime
anymore.

The fix is simple:
Non-alter operations must update sem_otime.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Jia He <jiakernel@gmail.com>
Tested-by: Jia He <jiakernel@gmail.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8c63376 30-Sep-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: synchronize the proc interface

The proc interface is not aware of sem_lock(), it instead calls
ipc_lock_object() directly. This means that simple semop() operations
can run in parallel with the proc interface. Right now, this is
uncritical, because the implementation doesn't do anything that requires
a proper synchronization.

But it is dangerous and therefore should be fixed.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d07b68c 30-Sep-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize sem_lock()

Operations that need access to the whole array must guarantee that there
are no simple operations ongoing. Right now this is achieved by
spin_unlock_wait(sem->lock) on all semaphores.

If complex_count is nonzero, then this spin_unlock_wait() is not
necessary, because it was already performed in the past by the thread
that increased complex_count and even though sem_perm.lock was dropped
inbetween, no simple operation could have started, because simple
operations cannot start when complex_count is non-zero.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5e9d5275 30-Sep-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix race in sem_lock()

The exclusion of complex operations in sem_lock() is insufficient: after
acquiring the per-semaphore lock, a simple op must first check that
sem_perm.lock is not locked and only after that test check
complex_count. The current code does it the other way around - and that
creates a race. Details are below.

The patch is a complete rewrite of sem_lock(), based in part on the code
from Mike Galbraith. It removes all gotos and all loops and thus the
risk of livelocks.

I have tested the patch (together with the next one) on my i3 laptop and
it didn't cause any problems.

The bug is probably also present in 3.10 and 3.11, but for these kernels
it might be simpler just to move the test of sma->complex_count after
the spin_is_locked() test.

Details of the bug:

Assume:
- sma->complex_count = 0.
- Thread 1: semtimedop(complex op that must sleep)
- Thread 2: semtimedop(simple op).

Pseudo-Trace:

Thread 1: sem_lock(): acquire sem_perm.lock
Thread 1: sem_lock(): check for ongoing simple ops
Nothing ongoing, thread 2 is still before sem_lock().
Thread 1: try_atomic_semop()
<<< preempted.

Thread 2: sem_lock():
static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
int nsops)
{
int locknum;
again:
if (nsops == 1 && !sma->complex_count) {
struct sem *sem = sma->sem_base + sops->sem_num;

/* Lock just the semaphore we are interested in. */
spin_lock(&sem->lock);

/*
* If sma->complex_count was set while we were spinning,
* we may need to look at things we did not lock here.
*/
if (unlikely(sma->complex_count)) {
spin_unlock(&sem->lock);
goto lock_array;
}
<<<<<<<<<
<<< complex_count is still 0.
<<<
<<< Here it is preempted
<<<<<<<<<

Thread 1: try_atomic_semop() returns, notices that it must sleep.
Thread 1: increases sma->complex_count.
Thread 1: drops sem_perm.lock
Thread 2:
/*
* Another process is holding the global lock on the
* sem_array; we cannot enter our critical section,
* but have to wait for the global lock to be released.
*/
if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
spin_unlock(&sem->lock);
spin_unlock_wait(&sma->sem_perm.lock);
goto again;
}
<<< sem_perm.lock already dropped, thus no "goto again;"

locknum = sops->sem_num;

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 53dad6d3 23-Sep-2013 Davidlohr Bueso <davidlohr@hp.com>

ipc: fix race with LSMs

Currently, IPC mechanisms do security and auditing related checks under
RCU. However, since security modules can free the security structure,
for example, through selinux_[sem,msg_queue,shm]_free_security(), we can
race if the structure is freed before other tasks are done with it,
creating a use-after-free condition. Manfred illustrates this nicely,
for instance with shared mem and selinux:

-> do_shmat calls rcu_read_lock()
-> do_shmat calls shm_object_check().
Checks that the object is still valid - but doesn't acquire any locks.
Then it returns.
-> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat)
-> selinux_shm_shmat calls ipc_has_perm()
-> ipc_has_perm accesses ipc_perms->security

shm_close()
-> shm_close acquires rw_mutex & shm_lock
-> shm_close calls shm_destroy
-> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security)
-> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm)
-> ipc_free_security calls kfree(ipc_perms->security)

This patch delays the freeing of the security structures after all RCU
readers are done. Furthermore it aligns the security life cycle with
that of the rest of IPC - freeing them based on the reference counter.
For situations where we need not free security, the current behavior is
kept. Linus states:

"... the old behavior was suspect for another reason too: having the
security blob go away from under a user sounds like it could cause
various other problems anyway, so I think the old code was at least
_prone_ to bugs even if it didn't have catastrophic behavior."

I have tested this patch with IPC testcases from LTP on both my
quad-core laptop and on a 64 core NUMA server. In both cases selinux is
enabled, and tests pass for both voluntary and forced preemption models.
While the mentioned races are theoretical (at least no one as reported
them), I wanted to make sure that this new logic doesn't break anything
we weren't aware of.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d9a605e4 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: rename ids->rw_mutex

Since in some situations the lock can be shared for readers, we shouldn't
be calling it a mutex, rename it to rwsem.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 758a6ba3 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: rename try_atomic_semop() to perform_atomic_semop(), docu update

Cleanup: Some minor points that I noticed while writing the previous
patches

1) The name try_atomic_semop() is misleading: The function performs the
operation (if it is possible).

2) Some documentation updates.

No real code change, a rename and documentation changes.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d12e1e50 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: replace shared sem_otime with per-semaphore value

sem_otime contains the time of the last semaphore operation that
completed successfully. Every operation updates this value, thus access
from multiple cpus can cause thrashing.

Therefore the patch replaces the variable with a per-semaphore variable.
The per-array sem_otime is only calculated when required.

No performance improvement on a single-socket i3 - only important for
larger systems.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f269f40a 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: always use only one queue for alter operations

There are two places that can contain alter operations:
- the global queue: sma->pending_alter
- the per-semaphore queues: sma->sem_base[].pending_alter.

Since one of the queues must be processed first, this causes an odd
priorization of the wakeups: complex operations have priority over
simple ops.

The patch restores the behavior of linux <=3.0.9: The longest waiting
operation has the highest priority.

This is done by using only one queue:
- if there are complex ops, then sma->pending_alter is used.
- otherwise, the per-semaphore queues are used.

As a side effect, do_smart_update_queue() becomes much simpler: no more
goto logic.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a82e9e1 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem: separate wait-for-zero and alter tasks into seperate queues

Introduce separate queues for operations that do not modify the
semaphore values. Advantages:

- Simpler logic in check_restart().
- Faster update_queue(): Right now, all wait-for-zero operations are
always tested, even if the semaphore value is not 0.
- wait-for-zero gets again priority, as in linux <=3.0.9

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f5c936c0 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: cacheline align the semaphore structures

As now each semaphore has its own spinlock and parallel operations are
possible, give each semaphore its own cacheline.

On a i3 laptop, this gives up to 28% better performance:

#semscale 10 | grep "interleave 2"
- before:
Cpus 1, interleave 2 delay 0: 36109234 in 10 secs
Cpus 2, interleave 2 delay 0: 55276317 in 10 secs
Cpus 3, interleave 2 delay 0: 62411025 in 10 secs
Cpus 4, interleave 2 delay 0: 81963928 in 10 secs

-after:
Cpus 1, interleave 2 delay 0: 35527306 in 10 secs
Cpus 2, interleave 2 delay 0: 70922909 in 10 secs <<< + 28%
Cpus 3, interleave 2 delay 0: 80518538 in 10 secs
Cpus 4, interleave 2 delay 0: 89115148 in 10 secs <<< + 8.7%

i3, with 2 cores and with hyperthreading enabled. Interleave 2 in order
use first the full cores. HT partially hides the delay from cacheline
trashing, thus the improvement is "only" 8.7% if 4 threads are running.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9ad66ae6 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: remove unused functions

We can now drop the msg_lock and msg_lock_check functions along with a
bogus comment introduced previously in semctl_down.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7b4cc5d8 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: move locking out of ipcctl_pre_down_nolock

This function currently acquires both the rw_mutex and the rcu lock on
successful lookups, leaving the callers to explicitly unlock them,
creating another two level locking situation.

Make the callers (including those that still use ipcctl_pre_down())
explicitly lock and unlock the rwsem and rcu lock.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cf9d5d78 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: close open coded spin lock calls

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ab465df9 26-May-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: Fix missing wakeups in do_smart_update_queue()

do_smart_update_queue() is called when an operation (semop,
semctl(SETVAL), semctl(SETALL), ...) modified the array. It must check
which of the sleeping tasks can proceed.

do_smart_update_queue() missed a few wakeups:
- if a sleeping complex op was completed, then all per-semaphore queues
must be scanned - not only those that were modified by *sops
- if a sleeping simple op proceeded, then the global queue must be
scanned again

And:
- the test for "|sops == NULL) before scanning the global queue is not
required: If the global queue is empty, then it doesn't need to be
scanned - regardless of the reason for calling do_smart_update_queue()

The patch is not optimized, i.e. even completing a wait-for-zero
operation causes a rescan. This is done to keep the patch as simple as
possible.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# de2657f9 09-May-2013 Rik van Riel <riel@redhat.com>

ipc,sem: fix semctl(..., GETNCNT)

The semctl GETNCNT returns the number of semops waiting for the
specified semaphore to become nonzero. After commit 9f1bc2c9022c
("ipc,sem: have only one list in struct sem_queue"), the semops waiting
on just one semaphore are waiting on that semaphore's list.

In order to return the correct count, we have to walk that list too, in
addition to the sem_array's list for complex operations.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ebc2e5e6 09-May-2013 Rik van Riel <riel@redhat.com>

ipc,sem: fix semctl(..., GETZCNT)

The semctl GETZCNT returns the number of semops waiting for the
specified semaphore to become zero. After commit 9f1bc2c9022c
("ipc,sem: have only one list in struct sem_queue"), the semops waiting
on just one semaphore are waiting on that semaphore's list.

In order to return the correct count, we have to walk that list too, in
addition to the sem_array's list for complex operations.

This bug broke dbench; it works again with this patch applied.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Kent Overstreet <koverstreet@google.com>
Tested-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 941b0304 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: simplify rcu_read_lock() in semctl_nolock()

This trivially combines two rcu_read_lock() calls in both sides of a
if-statement into one single one in front of the if-statement.

Split out as an independent cleanup from the previous commit.

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c728b9c8 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: simplify semtimedop/semctl_main() common error path handling

With various straight RCU lock/unlock movements, one common exit path
pattern had become

rcu_read_unlock();
goto out_wakeup;

and in fact there were no cases where we wanted to exit to out_wakeup
_without_ releasing the RCU read lock.

So replace that pattern with "goto out_rcu_wakeup", and remove the old
out_wakeup.

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 321310ce 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: move sem_obtain_lock() rcu locking into the only caller

sem_obtain_lock() was another of those functions that returned with the
RCU lock held for reading in the success case. Move the RCU locking to
the caller (semtimedop()), making it more obvious. We already did RCU
locking elsewhere in that function.

Side note: why does semtimedop() re-do the semphore lookup after the
sleep, rather than just getting a reference to the semaphore it already
looked up originally?

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fbfd1d28 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: fix double sem unlock in semctl error path

Fix another ipc locking buglet introduced by the scalability patches:
when semctl_down() was changed to delay the semaphore locking, one error
path for security_sem_semctl() went through the semaphore unlock logic
even though the semaphore had never been locked.

Introduced by commit 16df3674efe3 ("ipc,sem: do not hold ipc lock more
than necessary")

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4091fd94 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: move the rcu_read_lock() from sem_lock_and_putref() into callers

This is another ipc semaphore locking cleanup, trying to make the
locking more straightforward. We move the rcu read locking into the
callers of sem_lock_and_putref(), which in general means that we now
mostly do the rcu_read_lock() and rcu_read_unlock() in the same
function.

Mostly. We still have the ipc_addid/newary/freeary mess, and things
like ipcctl_pre_down_nolock().

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 73b29505 03-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: sem_putref() does not need the semaphore lock any more

ipc_rcu_putref() uses atomics for the refcount, and the games to lock
and unlock the semaphore just to try to keep the reference counting
working are no longer useful.

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d49dab8 03-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: move rcu_read_unlock() out of sem_unlock() and into callers

The IPC locking is a mess, and sem_unlock() unlocks not only the
semaphore spinlock, it also drops the rcu read lock. Unlike sem_lock(),
which just gets the spin-lock, and expects the caller to get the rcu
read lock.

This all makes things very hard to follow, and it's very confusing when
you take the rcu read lock in one function, and then release it in
another. And it has caused actual bugs: the sem_obtain_lock() function
ended up dropping the RCU read lock twice in one error path, because it
first did the sem_unlock(), and then did a rcu_read_unlock() to match
the rcu_read_lock() it had done.

This is just a totally mindless "remove rcu_read_unlock() from
sem_unlock() and add it immediately after each caller" (except for the
aforementioned bug where we did too many rcu_read_unlock(), and in
find_alloc_undo() where we just got the rcu_read_lock() to correct for
the fact that sem_unlock would immediately drop it again).

We can (and should) clean things up further, but this fixes the bug with
the minimal amount of subtlety.

Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce857229 02-May-2013 Al Viro <viro@ZenIV.linux.org.uk>

ipc: fix GETALL/IPC_RM race for sysv semaphores

We can step on WARN_ON_ONCE() in sem_getref() if a semaphore is removed
just as we are about to call sem_getref() from semctl_main(); results
are not pretty.

We should fail with -EIDRM, same as if IPC_RM happened while we'd been
doing allocation there. This also expands sem_getref() at its only
callsite (and fixed there), while sem_getref_and_unlock() is simply
killed off - it has no callers at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6062a8dc 30-Apr-2013 Rik van Riel <riel@surriel.com>

ipc,sem: fine grained locking for semtimedop

Introduce finer grained locking for semtimedop, to handle the common case
of a program wanting to manipulate one semaphore from an array with
multiple semaphores.

If the call is a semop manipulating just one semaphore in an array with
multiple semaphores, only take the lock for that semaphore itself.

If the call needs to manipulate multiple semaphores, or another caller is
in a transaction that manipulates multiple semaphores, the sem_array lock
is taken, as well as all the locks for the individual semaphores.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

vanilla Davidlohr's Davidlohr's + Davidlohr's +
threads patches rwlock patches v3 patches
10 610652 726325 1783589 2142206
20 341570 365699 1520453 1977878
30 288102 307037 1498167 2037995
40 290714 305955 1612665 2256484
50 288620 312890 1733453 2650292
60 289987 306043 1649360 2388008
70 291298 306347 1723167 2717486
80 290948 305662 1729545 2763582
90 290996 306680 1736021 2757524
100 292243 306700 1773700 3059159

[davidlohr.bueso@hp.com: do not call sem_lock when bogus sma]
[davidlohr.bueso@hp.com: make refcounter atomic]
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Emmanuel Benisty <benisty.e@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9f1bc2c9 30-Apr-2013 Rik van Riel <riel@surriel.com>

ipc,sem: have only one list in struct sem_queue

Having only one list in struct sem_queue, and only queueing simple
semaphore operations on the list for the semaphore involved, allows us to
introduce finer grained locking for semtimedop.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c460b662 30-Apr-2013 Rik van Riel <riel@surriel.com>

ipc,sem: open code and rename sem_lock

Rename sem_lock() to sem_obtain_lock(), so we can introduce a sem_lock()
later that only locks the sem_array and does nothing else.

Open code the locking from ipc_lock() in sem_obtain_lock() so we can
introduce finer grained locking for the sem_array in the next patch.

[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno out of sem_obtain_lock()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 16df3674 30-Apr-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,sem: do not hold ipc lock more than necessary

Instead of holding the ipc lock for permissions and security checks, among
others, only acquire it when necessary.

Some numbers....

1) With Rik's semop-multi.c microbenchmark we can see the following
results:

Baseline (3.9-rc1):
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151452270, ops/sec 5048409

+ 59.40% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 6.14% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 3.84% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 3.64% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 2.06% a.out [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 1.86% a.out [kernel.kallsyms] [k] ipc_lock

With this patchset:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 273156400, ops/sec 9105213

+ 18.54% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 11.72% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 7.70% a.out [kernel.kallsyms] [k] ipc_has_perm.isra.21
+ 6.58% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 6.54% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 4.71% a.out [kernel.kallsyms] [k] ipc_obtain_object_check

2) While on an Oracle swingbench DSS (data mining) workload the
improvements are not as exciting as with Rik's benchmark, we can see
some positive numbers. For an 8 socket machine the following are the
percentages of %sys time incurred in the ipc lock:

Baseline (3.9-rc1):
100 swingbench users: 8,74%
400 swingbench users: 21,86%
800 swingbench users: 84,35%

With this patchset:
100 swingbench users: 8,11%
400 swingbench users: 19,93%
800 swingbench users: 77,69%

[riel@redhat.com: fix two locking bugs]
[sasha.levin@oracle.com: prevent releasing RCU read lock twice in semctl_main]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e1fd1f49 05-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

get rid of union semop in sys_semctl(2) arguments

just have the bugger take unsigned long and deal with SETVAL
case (when we use an int member in the union) explicitly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 22d1a35d 21-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

make HAVE_SYSCALL_WRAPPERS unconditional

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1efdb69b 07-Feb-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Convert ipc to use kuid and kgid where appropriate

- Store the ipc owner and creator with a kuid
- Store the ipc group and the crators group with a kgid.
- Add error handling to ipc_update_perms, allowing it to
fail if the uids and gids can not be converted to kuids
or kgids.
- Modify the proc files to display the ipc creator and
owner in the user namespace of the opener of the proc file.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# e57940d7 02-Nov-2011 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: remove private structures from public header file

include/linux/sem.h contains several structures that are only used within
ipc/sem.c.

The patch moves them into ipc/sem.c - there is no need to expose the
structures to the whole kernel.

No functional changes, only whitespace cleanups and 80-char per line
fixes.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b0577f6 02-Nov-2011 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: handle spurious wakeups

semtimedop() does not handle spurious wakeups, it returns -EINTR to user
space. Most other schedule() users would just loop and not return to user
space. The patch adds such a loop to semtimedop()

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3c24783b 02-Nov-2011 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix return code race with semop vs. semop +semctl(IPC_RMID)

sys_semtimedop() may return -EIDRM although the semaphore operation
completed successfully:

thread 1: thread 2:
semtimedop(), sleeps
semop():
* acquires sem_lock()
semtimedop() woken up due to timeout
sem_lock() loops
* notices that thread 2 could be completed.
* performs the operations that thread 2 is sleeping on.
* marks the semaphore operation as IN_WAKEUP
* drops sem_lock(), does wakeup, sets return code to 0
* thread delayed due to interrupt, whatever
* returns to user space
* thread still delayed
semctl(IPC_RMID)
* acquires sem_lock()
* ipc_rmid(), ipcp->deleted=1
* drops sem_lock()
* thread finally continues - but seem_lock()
now fails due to ipcp->deleted == 1
* returns -EIDRM instead of 0

The fix is trivial: Always use the return code in queue.status.

In real world, the race probably doesn't matter:
If the semaphore array is destroyed, the app is probably not interested
if the last operation succeeded or was already cancelled.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d694ad62 25-Jul-2011 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix race with concurrent semtimedop() timeouts and IPC_RMID

If a semaphore array is removed and in parallel a sleeping task is woken
up (signal or timeout, does not matter), then the woken up task does not
wait until wake_up_sem_queue_do() is completed. This will cause crashes,
because wake_up_sem_queue_do() will read from a stale pointer.

The fix is simple: Regardless of anything, always call get_queue_result().
This function waits until wake_up_sem_queue_do() has finished it's task.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=27142

Reported-by: Yuriy Yevtukhov <yuriy@ucoz.com>
Reported-by: Harald Laabs <kernel@dasr.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@kernel.org> [2.6.35+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 693a8b6e 17-Mar-2011 Lai Jiangshan <laijs@cn.fujitsu.com>

ipc,rcu: Convert call_rcu(free_un) to kfree_rcu()

The rcu callback free_un() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_un).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# b0e77598 23-Mar-2011 Serge E. Hallyn <serge@hallyn.com>

userns: user namespaces: convert several capable() calls

CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
because the resource comes from current's own ipc namespace.

setuid/setgid are to uids in own namespace, so again checks can be against
current_user_ns().

Changelog:
Jan 11: Use task_ns_capable() in place of sched_capable().
Jan 11: Use nsown_capable() as suggested by Bastian Blank.
Jan 11: Clarify (hopefully) some logic in futex and sched.c
Feb 15: use ns_capable for ipc, not nsown_capable
Feb 23: let copy_ipcs handle setting ipc_ns->user_ns
Feb 23: pass ns down rather than taking it from current

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 982f7c2b 30-Sep-2010 Dan Rosenberg <drosenberg@vsecurity.com>

sys_semctl: fix kernel stack leakage

The semctl syscall has several code paths that lead to the leakage of
uninitialized kernel stack memory (namely the IPC_INFO, SEM_INFO,
IPC_STAT, and SEM_STAT commands) during the use of the older, obsolete
version of the semid_ds struct.

The copy_semid_to_user() function declares a semid_ds struct on the stack
and copies it back to the user without initializing or zeroing the
"sem_base", "sem_pending", "sem_pending_last", and "undo" pointers,
allowing the leakage of 16 bytes of kernel stack memory.

The code is still reachable on 32-bit systems - when calling semctl()
newer glibc's automatically OR the IPC command with the IPC_64 flag, but
invoking the syscall directly allows users to use the older versions of
the struct.

Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c61284e9 20-Jul-2010 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: bugfix for semop() not reporting successful operation

The last change to improve the scalability moved the actual wake-up out of
the section that is protected by spin_lock(sma->sem_perm.lock).

This means that IN_WAKEUP can be in queue.status even when the spinlock is
acquired by the current task. Thus the same loop that is performed when
queue.status is read without the spinlock acquired must be performed when
the spinlock is acquired.

Thanks to kamezawa.hiroyu@jp.fujitsu.com for noticing lack of the memory
barrier.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16255

[akpm@linux-foundation.org: clean up kerneldoc, checkpatch warning and whitespace]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Luca Tettamanti <kronos.it@gmail.com>
Tested-by: Luca Tettamanti <kronos.it@gmail.com>
Reported-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4de85cd6 26-May-2010 Julia Lawall <julia@diku.dk>

ipc/sem.c: use ERR_CAST

Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c5cf6359 26-May-2010 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: update description of the implementation

ipc/sem.c begins with a 15 year old description about bugs in the initial
implementation in Linux-1.0. The patch replaces that with a top level
description of the current code.

A TODO could be derived from this text:

The opengroup man page for semop() does not mandate FIFO. Thus there is
no need for a semaphore array list of pending operations.

If

- this list is removed
- the per-semaphore array spinlock is removed (possible if there is no
list to protect)
- sem_otime is moved into the semaphores and calculated on demand during
semctl()

then the array would be read-mostly - which would significantly improve
scaling for applications that use semaphore arrays with lots of entries.

The price would be expensive semctl() calls:

for(i=0;i<sma->sem_nsems;i++) spin_lock(sma->sem_lock);
<do stuff>
for(i=0;i<sma->sem_nsems;i++) spin_unlock(sma->sem_lock);

I'm not sure if the complexity is worth the effort, thus here is the
documentation of the current behavior first.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a2b9d4c 26-May-2010 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: move wake_up_process out of the spinlock section

The wake-up part of semtimedop() consists out of two steps:

- the right tasks must be identified.
- they must be woken up.

Right now, both steps run while the array spinlock is held. This patch
reorders the code and moves the actual wake_up_process() behind the point
where the spinlock is dropped.

The code also moves setting sem->sem_otime to one place: It does not make
sense to set the last modify time multiple times.

[akpm@linux-foundation.org: repair kerneldoc]
[akpm@linux-foundation.org: fix uninitialised retval]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd5db422 26-May-2010 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize update_queue() for bulk wakeup calls

The following series of patches tries to fix the spinlock contention
reported by Chris Mason - his benchmark exposes problems of the current
code:

- In the worst case, the algorithm used by update_queue() is O(N^2).
Bulk wake-up calls can enter this worst case. The patch series fix
that.

Note that the benchmark app doesn't expose the problem, it just should
be fixed: Real world apps might do the wake-ups in another order than
perfect FIFO.

- The part of the code that runs within the semaphore array spinlock is
significantly larger than necessary.

The patch series fixes that. This change is responsible for the main
improvement.

- The cacheline with the spinlock is also used for a variable that is
read in the hot path (sem_base) and for a variable that is unnecessarily
written to multiple times (sem_otime). The last step of the series
cacheline-aligns the spinlock.

This patch:

The SysV semaphore code allows to perform multiple operations on all
semaphores in the array as atomic operations. After a modification,
update_queue() checks which of the waiting tasks can complete.

The algorithm that is used to identify the tasks is O(N^2) in the worst
case. For some cases, it is simple to avoid the O(N^2).

The patch adds a detection logic for some cases, especially for the case
of an array where all sleeping tasks are single sembuf operations and a
multi-sembuf operation is used to wake up multiple tasks.

A big database application uses that approach.

The patch fixes wakeup due to semctl(,,SETALL,) - the initial version of
the patch breaks that.

[akpm@linux-foundation.org: make do_smart_update() static]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e5cc9c7b 15-Dec-2009 Amerigo Wang <amwang@redhat.com>

ipc: remove unreachable code in sem.c

This line is unreachable, remove it.

[akpm@linux-foundation.org: remove unneeded initialisation of `err']
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d987f8b2 15-Dec-2009 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize single sops when semval is zero

If multiple simple decrements on the same semaphore are pending, then the
current code scans all decrement operations, even if the semaphore value
is already 0.

The patch optimizes that: if the semaphore value is 0, then there is no
need to scan the q->alter entries.

Note that this is a common case: It happens if 100 decrements by one are
pending and now an increment by one increases the semaphore value from 0
to 1. Without this patch, all 100 entries are scanned. With the patch,
only one entry is scanned, then woken up. Then the new rule triggers and
the scanning is aborted, without looking at the remaining 99 tasks.

With this patch, single sop increment/decrement by 1 are now O(1).
(same as with Nick's patch)

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 636c6be8 15-Dec-2009 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize single semop operations

sysv sem has the concept of semaphore arrays that consist out of multiple
semaphores. Atomic operations that affect multiple semaphores are
supported.

The patch optimizes single semaphore operation calls that affect only one
semaphore: It's not necessary to scan all pending operations, it is
sufficient to scan the per-semaphore list.

The idea is from Nick Piggin version of an ipc sem improvement, the
implementation is different: The code tries to keep as much common code as
possible.

As the result, the patch is simpler, but optimizes fewer cases.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b97e820f 15-Dec-2009 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: add a per-semaphore pending list

Based on Nick's findings:

sysv sem has the concept of semaphore arrays that consist out of multiple
semaphores. Atomic operations that affect multiple semaphores are
supported.

The patch is the first step for optimizing simple, single semaphore
operations: In addition to the global list of all pending operations, a
2nd, per-semaphore list with the simple operations is added.

Note: this patch does not make sense by itself, the new list is used
nowhere.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b6e90822 15-Dec-2009 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize if semops fail

Reduce the amount of scanning of the list of pending semaphore operations:
If try_atomic_semop failed, then no changes were applied. Thus no need to
restart.

Additionally, this patch correct an incorrect comment: It's possible to
wait for arbitrary semaphore values (do a dec by <x>, wait-for-zero, inc
by <x> in one atomic operation)

Both changes are from Nick Piggin, the patch is the result of a different
split of the individual changes.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d4212093 15-Dec-2009 Nick Piggin <npiggin@suse.de>

ipc/sem.c: sem preempt improve

The strange sysv semaphore wakeup scheme has a kind of busy-wait lock
involved, which could deadlock if preemption is enabled during the "lock".

It is an implementation detail (due to a spinlock being held) that this is
actually the case. However if "spinlocks" are made preemptible, or if the
sem lock is changed to a sleeping lock for example, then the wakeup would
become buggy. So this might be a bugfix for -rt kernels.

Imagine waker being preempted by wakee and never clearing IN_WAKEUP -- if
wakee has higher RT priority then there is a priority inversion deadlock.
Even if there is not a priority inversion to cause a deadlock, then there
is still time wasted spinning.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9cad200c 15-Dec-2009 Nick Piggin <npiggin@suse.de>

ipc/sem.c: sem use list operations

Replace the handcoded list operations in update_queue() with the standard
list_for_each_entry macros.

list_for_each_entry_safe() must be used, because list entries can
disappear immediately uppon the wakeup event.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bf17bb71 15-Dec-2009 Nick Piggin <npiggin@suse.de>

ipc/sem.c: sem optimise undo list search

Around a month ago, there was some discussion about an improvement of the
sysv sem algorithm: Most (at least: some important) users only use simple
semaphore operations, therefore it's worthwile to optimize this use case.

This patch:

Move last looked up sem_undo struct to the head of the task's undo list.
Attempt to move common entries to the front of the list so search time is
reduced. This reduces lookup_undo on oprofile of problematic SAP workload
by 30% (see patch 4 for a description of SAP workload).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7d6feeb2 15-Dec-2009 Serge E. Hallyn <serue@us.ibm.com>

ipc ns: fix memory leak (idr)

We have apparently had a memory leak since
7ca7e564e049d8b350ec9d958ff25eaa24226352 "ipc: store ipcs into IDRs" in
2007. The idr of which 3 exist for each ipc namespace is never freed.

This patch simply frees them when the ipcns is freed. I don't believe any
idr_remove() are done from rcu (and could therefore be delayed until after
this idr_destroy()), so the patch should be safe. Some quick testing
showed no harm, and the memory leak fixed.

Caught by kmemleak.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 05725f7e 14-Apr-2009 Jiri Pirko <jpirko@redhat.com>

rculist: use list_entry_rcu in places where it's appropriate

Use previously introduced list_entry_rcu instead of an open-coded
list_entry + rcu_dereference combination.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: dipankar@in.ibm.com
LKML-Reference: <20090414181715.GA3634@psychotron.englab.brq.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# d5460c99 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] System call wrappers part 25

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 6673e0c3 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] System call wrapper special cases

System calls with an unsigned long long argument can't be converted with
the standard wrappers since that would include a cast to long, which in
turn means that we would lose the upper 32 bit on 32 bit architectures.
Also semctl can't use the standard wrapper since it has a 'union'
parameter.

So we handle them as special case and add some extra wrappers instead.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# e953ac21 06-Jan-2009 Denis V. Lunev <den@openvz.org>

ipc: do not goto to the next line

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: WANG Cong <wangcong@zeuux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 046c6884 05-Jan-2009 Alan Cox <alan@lxorguk.ukuu.org.uk>

mm: update my address

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d97e234 15-Oct-2008 Adrian Bunk <bunk@kernel.org>

ipc/sem.c: make free_un() static

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 380af1b3 25-Jul-2008 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: rewrite undo list locking

The attached patch:
- reverses the locking order of ulp->lock and sem_lock:
Previously, it was first ulp->lock, then inside sem_lock.
Now it's the other way around.
- converts the undo structure to rcu.

Benefits:
- With the old locking order, IPC_RMID could not kfree the undo structures.
The stale entries remained in the linked lists and were released later.
- The patch fixes a a race in semtimedop(): if both IPC_RMID and a semget() that
recreates exactly the same id happen between find_alloc_undo() and sem_lock,
then semtimedop() would access already kfree'd memory.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1193f8e 25-Jul-2008 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: convert sem_array.sem_pending to struct list_head

sem_array.sem_pending is a double linked list, the attached patch converts
it to struct list_head.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2c0c29d4 25-Jul-2008 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: remove unused entries from struct sem_queue

sem_queue.sma and sem_queue.id were never used, the attached patch removes
them.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4daa28f6 25-Jul-2008 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: convert undo structures to struct list_head

The undo structures contain two linked lists, the attached patch replaces
them with generic struct list_head lists.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9edff4ab 29-Apr-2008 Manfred Spraul <manfred@colorfullife.com>

ipc: sysvsem: implement sys_unshare(CLONE_SYSVSEM)

sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.

Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM)

The original reason to not support it was the potential (inevitable?)
confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the
inverse meaning of clone(CLONE_SYSVSEM).

Our two most reasonable options then appear to be (1) fully support
CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM,
but always do it anyway on unshare(CLONE_SYSVSEM). This patch does
(1).

Changelog:
Apr 16: SEH: switch to Manfred's alternative patch which
removes the unshare_semundo() function which
always refused CLONE_SYSVSEM.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5f75e7f 29-Apr-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC: consolidate all xxxctl_down() functions

semctl_down(), msgctl_down() and shmctl_down() are used to handle the same set
of commands for each kind of IPC. They all start to do the same job (they
retrieve the ipc and do some permission checks) before handling the commands
on their own.

This patch proposes to consolidate this by moving these same pieces of code
into one common function called ipcctl_pre_down().

It simplifies a little these xxxctl_down() functions and increases a little
the maintainability.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8f4a3809 29-Apr-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC: introduce ipc_update_perm()

The IPC_SET command performs the same permission setting for all IPCs. This
patch introduces a common ipc_update_perm() function to update these
permissions and makes use of it for all IPCs.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 016d7132 29-Apr-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC: get rid of the use *_setbuf structure.

All IPCs make use of an intermetiate *_setbuf structure to handle the IPC_SET
command. This is not really needed and, moreover, it complicates a little bit
the code.

This patch gets rid of the use of it and uses directly the semid64_ds/
msgid64_ds/shmid64_ds structure.

In addition of removing one struture declaration, it also simplifies and
improves a little bit the common 64-bits path.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 21a4826a 29-Apr-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC/semaphores: remove one unused parameter from semctl_down()

semctl_down() takes one unused parameter: semnum. This patch proposes to get
rid of it.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 522bb2a2 29-Apr-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC/semaphores: move the rwmutex handling inside semctl_down

semctl_down is called with the rwmutex (the one which protects the list of
ipcs) taken in write mode.

This patch moves this rwmutex taken in write-mode inside semctl_down.

This has the advantages of reducing a little bit the window during which this
rwmutex is taken, clarifying sys_semctl, and finally of having a coherent
behaviour with [shm|msg]ctl_down

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6ff37972 29-Apr-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC/semaphores: code factorisation

Trivial patch which adds some small locking functions and makes use of them to
factorize some part of the code and to make it cleaner.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 48dea404 29-Apr-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC: use ipc_buildid() directly from ipc_addid()

By continuing to consolidate a little the IPC code, each id can be built
directly in ipc_addid() instead of having it built from each callers of
ipc_addid()

And I also remove shm_addid() in order to have, as much as possible, the
same code for shm/sem/msg.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 01b8b07a 08-Feb-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC: consolidate sem_exit_ns(), msg_exit_ns() and shm_exit_ns()

sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an
ipc_namespace is released to free all ipcs of each type. But in fact, they
do the same thing: they loop around all ipcs to free them individually by
calling a specific routine.

This patch proposes to consolidate this by introducing a common function,
free_ipcs(), that do the job. The specific routine to call on each
individual ipcs is passed as parameter. For this, these ipc-specific
'free' routines are reworked to take a generic 'struct ipc_perm' as
parameter.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed2ddbf8 08-Feb-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC: make struct ipc_ids static in ipc_namespace

Each ipc_namespace contains a table of 3 pointers to struct ipc_ids (3 for
msg, sem and shm, structure used to store all ipcs) These 'struct ipc_ids'
are dynamically allocated for each icp_namespace as the ipc_namespace
itself (for the init namespace, they are initialized with pointers to
static variables instead)

It is so for historical reason: in fact, before the use of idr to store the
ipcs, the ipcs were stored in tables of variable length, depending of the
maximum number of ipc allowed. Now, these 'struct ipc_ids' have a fixed
size. As they are allocated in any cases for each new ipc_namespace, there
is no gain of memory in having them allocated separately of the struct
ipc_namespace.

This patch proposes to make this table static in the struct ipc_namespace.
Thus, we can allocate all in once and get rid of all the code needed to
allocate and free these ipc_ids separately.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b9fcb0e 08-Feb-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC/semaphores: consolidate SEM_STAT and IPC_STAT commands

These commands (SEM_STAT and IPC_STAT) are rather doing the same things
(only the meaning of the id given as input and the return value differ).
However, for the semaphores, they are handled in two different places (two
different functions).

This patch consolidates this for clarification by handling these both
commands in the same place in semctl_nolock(). It also removes one unused
parameter for this function.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ae5e1b22 08-Feb-2008 Pavel Emelyanov <xemul@openvz.org>

namespaces: move the IPC namespace under IPC_NS option

Currently the IPC namespace management code is spread over the ipc/*.c files.
I moved this code into ipc/namespace.c file which is compiled out when needed.

The linux/ipc_namespace.h file is used to store the prototypes of the
functions in namespace.c and the stubs for NAMESPACES=n case. This is done
so, because the stub for copy_ipc_namespace requires the knowledge of the
CLONE_NEWIPC flag, which is in sched.h. But the linux/ipc.h file itself in
included into many many .c files via the sys.h->sem.h sequence so adding the
sched.h into it will make all these .c depend on sched.h which is not that
good. On the other hand the knowledge about the namespaces stuff is required
in 4 .c files only.

Besides, this patch compiles out some auxiliary functions from ipc/sem.c,
msg.c and shm.c files. It turned out that moving these functions into
namespaces.c is not that easy because they use many other calls and macros
from the original file. Moving them would make this patch complicated. On
the other hand all these functions can be consolidated, so I will send a
separate patch doing this a bit later.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b1ed88b4 06-Feb-2008 Pierre Peiffer <pierre.peiffer@bull.net>

IPC: fix error check in all new xxx_lock() and xxx_exit_ns() functions

In the new implementation of the [sem|shm|msg]_lock[_check]() routines, we
use the return value of ipc_lock() in container_of() without any check.
But ipc_lock may return a errcode. The use of this errcode in
container_of() may alter this errcode, and we don't want this.

And in xxx_exit_ns, the pointer return by idr_find is of type 'struct
kern_ipc_per'...

Today, the code will work as is because the member used in these
container_of() is the first member of its container (offset == 0), the
errcode isn't changed then. But in the general case, we can't count on
this assumption and this may lead later to a real bug if we don't correct
this.

Again, the proposed solution is simple and correct. But, as pointed by
Nadia, with this solution, the same check will be done several times (in
all sub-callers...), what is not very funny/optimal...

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 283bb7fa 19-Oct-2007 Pierre Peiffer <Pierre.Peiffer@bull.net>

IPC: fix error case when idr-cache is empty in ipcget()

With the use of idr to store the ipc, the case where the idr cache is
empty, when idr_get_new is called (this may happen even if we call
idr_pre_get() before), is not well handled: it lets
semget()/shmget()/msgget() return ENOSPC when this cache is empty, what 1.
does not reflect the facts and 2. does not conform to the man(s).

This patch fixes this by retrying the whole process of allocation in this case.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c530c6ac 19-Oct-2007 Pierre Peiffer <pierre.peiffer@bull.net>

IPC: cleanup some code and wrong comments about semundo list managment

Some comments about sem_undo_list seem wrong.
About the comment above unlock_semundo:
"... If task2 now exits before task1 releases the lock (by calling
unlock_semundo()), then task1 will never call spin_unlock(). ..."

This is just wrong, I see no reason for which task1 will not call
spin_unlock... The rest of this comment is also wrong... Unless I
miss something (of course).

Finally, (un)lock_semundo functions are useless, so remove them
for simplification. (this avoids an useless if statement)

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1b531f21 19-Oct-2007 Nadia Derbey <Nadia.Derbey@bull.net>

ipc: remove unneeded parameters

Remvoe the unneeded parameters from ipc_checkid() and ipc_buildid()
interfaces.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3e148c79 19-Oct-2007 Nadia Derbey <Nadia.Derbey@bull.net>

fix idr_find() locking

This is a patch that fixes the way idr_find() used to be called in ipc_lock():
in all the paths that don't imply an update of the ipcs idr, it was called
without the idr tree being locked.

The changes are:
. in ipc_ids, the mutex has been changed into a reader/writer semaphore.
. ipc_lock() now takes the mutex as a reader during the idr_find().
. a new routine ipc_lock_down() has been defined: it doesn't take the
mutex, assuming that it is being held by the caller. This is the routine
that is now called in all the update paths.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4566f04 19-Oct-2007 Nadia Derbey <Nadia.Derbey@bull.net>

ipc: fix wrong comments

This patch fixes the wrong / obsolete comments in the ipc code. Also adds
a missing lock around ipc_get_maxid() in shm_get_stat().

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03f02c76 19-Oct-2007 Nadia Derbey <Nadia.Derbey@bull.net>

Storing ipcs into IDRs

This patch converts casts of struct kern_ipc_perm to
. struct msg_queue
. struct sem_array
. struct shmid_kernel
into the equivalent container_of() macro. It improves code maintenance
because the code need not change if kern_ipc_perm is no longer at the
beginning of the containing struct.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 023a5355 19-Oct-2007 Nadia Derbey <Nadia.Derbey@bull.net>

ipc: integrate ipc_checkid() into ipc_lock()

This patch introduces a new ipc_lock_check() routine interface:
. each time ipc_checkid() is called, this is done after calling ipc_lock().
ipc_checkid() is now called from inside ipc_lock_check().

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix RCU locking]
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7748dbfa 19-Oct-2007 Nadia Derbey <Nadia.Derbey@bull.net>

ipc: unify the syscalls code

This patch introduces a change into the sys_msgget(), sys_semget() and
sys_shmget() routines: they now share a common code, which is better for
maintainability.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ca7e564 19-Oct-2007 Nadia Derbey <Nadia.Derbey@bull.net>

ipc: store ipcs into IDRs

This patch introduces ipcs storage into IDRs. The main changes are:
. This ipc_ids structure is changed: the entries array is changed into a
root idr structure.
. The grow_ary() routine is removed: it is not needed anymore when adding
an ipc structure, since we are now using the IDR facility.
. The ipc_rmid() routine interface is changed:
. there is no need for this routine to return the pointer passed in as
argument: it is now declared as a void
. since the id is now part of the kern_ipc_perm structure, no need to
have it as an argument to the routine

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b488893a 19-Oct-2007 Pavel Emelyanov <xemul@openvz.org>

pid namespaces: changes to show virtual ids to user

This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e1c091c 17-Jul-2007 Jeff Garzik <jeff@garzik.org>

arch/i386/* fs/* ipc/*: mark variables with uninitialized_var()

Mark variables with uninitialized_var() if such a warning appears,
and analysis proves that the var is initialized properly on all paths
it is used.

Signed-off-by: Jeff Garzik <jeff@garzik.org>


# 7d69a1f4 16-Jul-2007 Cedric Le Goater <clg@fr.ibm.com>

remove CONFIG_UTS_NS and CONFIG_IPC_NS

CONFIG_UTS_NS and CONFIG_IPC_NS have very little value as they only
deactivate the unshare of the uts and ipc namespaces and do not improve
performance.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e63340ae 08-May-2007 Randy Dunlap <randy.dunlap@oracle.com>

header cleaning: don't include smp_lock.h when not used

Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4668edc3 06-Dec-2006 Burman Yan <yan_952@hotmail.com>

[PATCH] kernel core: replace kmalloc+memset with kzalloc

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c7e12b83 02-Nov-2006 Pavel Emelianov <xemul@openvz.org>

[PATCH] Fix ipc entries removal

Fix two issuses related to ipc_ids->entries freeing.

1. When freeing ipc namespace we need to free entries allocated
with ipc_init_ids().

2. When removing old entries in grow_ary() ipc_rcu_putref()
may be called on entries set to &ids->nullentry earlier in
ipc_init_ids().
This is almost impossible without namespaces, but with
them this situation becomes possible.

Found during OpenVZ testing after obvious leaks in beancounters.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Cc: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2453a306 02-Oct-2006 Matt Helsley <matthltc@us.ibm.com>

[PATCH] ipc: replace kmalloc and memset in get_undo_list with kzalloc

Simplify get_undo_list() by dropping the unnecessary cast, removing the
size variable, and switching to kzalloc() instead of a kmalloc() followed
by a memset().

This cleanup was split then modified from Jes Sorenson's Task Notifiers
patches.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e3893534 02-Oct-2006 Kirill Korotaev <dev@openvz.org>

[PATCH] IPC namespace - sem

IPC namespace support for IPC sem code.

Signed-off-by: Pavel Emelianiov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# ac03221a 16-May-2006 Linda Knippers <linda.knippers@hp.com>

[PATCH] update of IPC audit record cleanup

The following patch addresses most of the issues with the IPC_SET_PERM
records as described in:
https://www.redhat.com/archives/linux-audit/2006-May/msg00010.html
and addresses the comments I received on the record field names.

To summarize, I made the following changes:

1. Changed sys_msgctl() and semctl_down() so that an IPC_SET_PERM
record is emitted in the failure case as well as the success case.
This matches the behavior in sys_shmctl(). I could simplify the
code in sys_msgctl() and semctl_down() slightly but it would mean
that in some error cases we could get an IPC_SET_PERM record
without an IPC record and that seemed odd.

2. No change to the IPC record type, given no feedback on the backward
compatibility question.

3. Removed the qbytes field from the IPC record. It wasn't being
set and when audit_ipc_obj() is called from ipcperms(), the
information isn't available. If we want the information in the IPC
record, more extensive changes will be necessary. Since it only
applies to message queues and it isn't really permission related, it
doesn't seem worth it.

4. Removed the obj field from the IPC_SET_PERM record. This means that
the kern_ipc_perm argument is no longer needed.

5. Removed the spaces and renamed the IPC_SET_PERM field names. Replaced iuid and
igid fields with ouid and ogid in the IPC record.

I tested this with the lspp.22 kernel on an x86_64 box. I believe it
applies cleanly on the latest kernel.

-- ljk

Signed-off-by: Linda Knippers <linda.knippers@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 073115d6 02-Apr-2006 Steve Grubb <sgrubb@redhat.com>

[PATCH] Rework of IPC auditing

1) The audit_ipc_perms() function has been split into two different
functions:
- audit_ipc_obj()
- audit_ipc_set_perm()

There's a key shift here... The audit_ipc_obj() collects the uid, gid,
mode, and SElinux context label of the current ipc object. This
audit_ipc_obj() hook is now found in several places. Most notably, it
is hooked in ipcperms(), which is called in various places around the
ipc code permforming a MAC check. Additionally there are several places
where *checkid() is used to validate that an operation is being
performed on a valid object while not necessarily having a nearby
ipcperms() call. In these locations, audit_ipc_obj() is called to
ensure that the information is captured by the audit system.

The audit_set_new_perm() function is called any time the permissions on
the ipc object changes. In this case, the NEW permissions are recorded
(and note that an audit_ipc_obj() call exists just a few lines before
each instance).

2) Support for an AUDIT_IPC_SET_PERM audit message type. This allows
for separate auxiliary audit records for normal operations on an IPC
object and permissions changes. Note that the same struct
audit_aux_data_ipcctl is used and populated, however there are separate
audit_log_format statements based on the type of the message. Finally,
the AUDIT_IPC block of code in audit_free_aux() was extended to handle
aux messages of this new type. No more mem leaks I hope ;-)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5f921ae9 26-Mar-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] sem2mutex: ipc, id.sem

Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 27315c96 26-Mar-2006 Eric Sesterhenn <snakebyte@gmx.de>

BUG_ON() Conversion in ipc/sem.c

this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 8c8570fb 03-Nov-2005 Dustin Kirkland <dustin.kirkland@us.ibm.com>

[PATCH] Capture selinux subject/object context information.

This patch extends existing audit records with subject/object context
information. Audit records associated with filesystem inodes, ipc, and
tasks now contain SELinux label information in the field "subj" if the
item is performing the action, or in "obj" if the item is the receiver
of an action.

These labels are collected via hooks in SELinux and appended to the
appropriate record in the audit code.

This additional information is required for Common Criteria Labeled
Security Protection Profile (LSPP).

[AV: fixed kmalloc flags use]
[folded leak fixes]
[folded cleanup from akpm (kfree(NULL)]
[folded audit_inode_context() leak fix]
[folded akpm's fix for audit_ipc_perm() definition in case of !CONFIG_AUDIT]

Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 624dffcb 14-Jan-2006 Christian Kujau <evil@g-house.de>

correct email address of Manfred Spraul

I tried to send the forcedeth maintainer an email, but it came back with:

"The mail address manfreds@colorfullife.com is not read anymore.
Please resent your mail to manfred@ instead of manfreds@."

This patch fixes this.

Signed-off-by: Adrian Bunk <bunk@stusta.de>


# c59ede7b 11-Jan-2006 Randy Dunlap <rdunlap@infradead.org>

[PATCH] move capable() to capability.h

- Move capable() from sched.h to capability.h;

- Use <linux/capability.h> where capable() is used
(in include/, block/, ipc/, kernel/, a few drivers/,
mm/, security/, & sound/;
many more drivers/ to go)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1224b375 24-Dec-2005 Linus Torvalds <torvalds@g5.osdl.org>

Fix silly typo ("smb" vs "smp")

Introduced by commit 6003a93e7bf6c02f33c02976ff364785d4273295


# 6003a93e 23-Dec-2005 Manfred Spraul <manfred@colorfullife.com>

[PATCH] add missing memory barriers to ipc/sem.c

Two smp_wmb() statements are missing in the sysv sem code: This could
cause stack corruptions.

The attached patch adds them.

Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 19b4946c 06-Sep-2005 Mike Waychison <mikew@google.com>

[PATCH] ipc: convert /proc/sysvipc/* to generic seq_file interface

Change the /proc/sysvipc/shm|sem|msg files to use the generic seq_file
implementation for struct ipc_ids.

Signed-off-by: Mike Waychison <mikew@google.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 00a5dfdb 05-Aug-2005 Ingo Molnar <mingo@elte.hu>

[PATCH] Fix semundo lock leakage

semundo->lock can leak if semundo->refcount goes from 2 to 1 while
another thread has it locked. This causes major problems for PREEMPT
kernels.

The simplest fix for now is to undo the single-thread optimization.

This bug was found via relentless testing by Dominik Karall.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b78755ab 23-Jun-2005 Manfred Spraul <manfred@colorfullife.com>

[PATCH] ipcsem: remove superflous decrease variable from sys_semtimedop

Patrick noticed that the initial scan of the semaphore operations logs
decrease and increase operations seperately, but then both cases are or'ed
together and decrease is never used. The attached patch removes the
decrease parameter - it shrinks sys_semtimedop() by 56 bytes.

Signed-Of-By: Manfred Spraul <manfred@colorfullife.com>

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!