History log of /linux-master/ipc/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
6a4746ba 11-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

ipc: remove memcg accounting for sops objects in do_semtimedop()

Linus proposes to revert an accounting for sops objects in
do_semtimedop() because it's really just a temporary buffer
for a single semtimedop() system call.

This object can consume up to 2 pages, syscall is sleeping
one, size and duration can be controlled by user, and this
allocation can be repeated by many thread at the same time.

However Shakeel Butt pointed that there are much more popular
objects with the same life time and similar memory
consumption, the accounting of which was decided to be
rejected for performance reasons.

Considering at least 2 pages for task_struct and 2 pages for
the kernel stack, a back of the envelope calculation gives a
footprint amplification of <1.5 so this temporal buffer can be
safely ignored.

The factor would IMO be interesting if it was >> 2 (from the
PoV of excessive (ab)use, fine-grained accounting seems to be
currently unfeasible due to performance impact).

Link: https://lore.kernel.org/lkml/90e254df-0dfe-f080-011e-b7c53ee7fd20@virtuozzo.com/
Fixes: 18319498fdd4 ("memcg: enable accounting of ipc resources")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

35776f10 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM development updates from Russell King:

- Rename "mod_init" and "mod_exit" so that initcall debug output is
actually useful (Randy Dunlap)

- Update maintainers entries for linux-arm-kernel to indicate it is
moderated for non-subscribers (Randy Dunlap)

- Move install rules to arch/arm/Makefile (Masahiro Yamada)

- Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij)

- Don't warn about atags_to_fdt() stack size (David Heidelberg)

- Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann)

- Get rid of set_fs() usage (Arnd Bergmann)

- Remove checks for GCC prior to v4.6 (Geert Uytterhoeven)

* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9118/1: div64: Remove always-true __div64_const32_is_OK() duplicate
ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK()
ARM: 9116/1: unified: Remove check for gcc < 4
ARM: 9110/1: oabi-compat: fix oabi epoll sparse warning
ARM: 9113/1: uaccess: remove set_fs() implementation
ARM: 9112/1: uaccess: add __{get,put}_kernel_nofault
ARM: 9111/1: oabi-compat: rework fcntl64() emulation
ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation
ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation
ARM: 9107/1: syscall: always store thread_info->abi_syscall
ARM: 9109/1: oabi-compat: add epoll_pwait handler
ARM: 9106/1: traps: use get_kernel_nofault instead of set_fs()
ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault
ARM: 9105/1: atags_to_fdt: don't warn about stack size
ARM: 9103/1: Drop ARCH_NR_GPIOS definition
ARM: 9102/1: move theinstall rules to arch/arm/Makefile
ARM: 9100/1: MAINTAINERS: mark all linux-arm-kernel@infradead list as moderated
ARM: 9099/1: crypto: rename 'mod_init' & 'mod_exit' functions to be module-specific


2d338201 08-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.

Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...


20401d10 07-Sep-2021 Rafael Aquini <aquini@redhat.com>

ipc: replace costly bailout check in sysvipc_find_ipc()

sysvipc_find_ipc() was left with a costly way to check if the offset
position fed to it is bigger than the total number of IPC IDs in use. So
much so that the time it takes to iterate over /proc/sysvipc/* files grows
exponentially for a custom benchmark that creates "N" SYSV shm segments
and then times the read of /proc/sysvipc/shm (milliseconds):

12 msecs to read 1024 segs from /proc/sysvipc/shm
18 msecs to read 2048 segs from /proc/sysvipc/shm
65 msecs to read 4096 segs from /proc/sysvipc/shm
325 msecs to read 8192 segs from /proc/sysvipc/shm
1303 msecs to read 16384 segs from /proc/sysvipc/shm
5182 msecs to read 32768 segs from /proc/sysvipc/shm

The root problem lies with the loop that computes the total amount of ids
in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than
"ids->in_use". That is a quite inneficient way to get to the maximum
index in the id lookup table, specially when that value is already
provided by struct ipc_ids.max_idx.

This patch follows up on the optimization introduced via commit
15df03c879836 ("sysvipc: make get_maxid O(1) again") and gets rid of the
aforementioned costly loop replacing it by a simpler checkpoint based on
ipc_get_maxidx() returned value, which allows for a smooth linear increase
in time complexity for the same custom benchmark:

2 msecs to read 1024 segs from /proc/sysvipc/shm
2 msecs to read 2048 segs from /proc/sysvipc/shm
4 msecs to read 4096 segs from /proc/sysvipc/shm
9 msecs to read 8192 segs from /proc/sysvipc/shm
19 msecs to read 16384 segs from /proc/sysvipc/shm
39 msecs to read 32768 segs from /proc/sysvipc/shm

Link: https://lkml.kernel.org/r/20210809203554.1562989-1-aquini@redhat.com
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Waiman Long <llong@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

18319498 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting of ipc resources

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Link: https://lkml.kernel.org/r/d6507b06-4df6-78f8-6c54-3ae86e3b5339@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

30acd0bd 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for new namesapces and struct nsproxy

Container admin can create new namespaces and force kernel to allocate up
to several pages of memory for the namespaces and its associated
structures.

Net and uts namespaces have enabled accounting for such allocations. It
makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bdec0145 11-Aug-2021 Arnd Bergmann <arnd@arndb.de>

ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation

sys_oabi_semtimedop() is one of the last users of set_fs() on Arm. To
remove this one, expose the internal code of the actual implementation
that operates on a kernel pointer and call it directly after copying.

There should be no measurable impact on the normal execution of this
function, and it makes the overly long function a little shorter, which
may help readability.

While reworking the oabi version, make it behave a little more like
the native one, using kvmalloc_array() and restructure the code
flow in a similar way.

The naming of __do_semtimedop() is not very good, I hope someone can
come up with a better name.

One regression was spotted by kernel test robot <rong.a.chen@intel.com>
and fixed before the first mailing list submission.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

71bd9341 02-Jul-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:
"190 patches.

Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...


b869d5be 30-Jun-2021 Manfred Spraul <manfred@colorfullife.com>

ipc/util.c: use binary search for max_idx

If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
MSG_INFO or SHM_INFO, then the return value is the index of the highest
used index in the kernel's internal array recording information about all
SysV objects of the requested type for the current namespace. (This
information can be used with repeated ..._STAT or ..._STAT_ANY operations
to obtain information about all SysV objects on the system.)

There is a cache for this value. But when the cache needs up be updated,
then the highest used index is determined by looping over all possible
values. With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
loop over 16 million entries. And due to /proc/sys/kernel/*next_id, the
index values do not need to be consecutive.

With <write 16000000 to msg_next_id>, msgget(), msgctl(,IPC_RMID) in a
loop, I have observed a performance increase of around factor 13000.

As there is no get_last() function for idr structures: Implement a
"get_last()" using a binary search.

As far as I see, ipc is the only user that needs get_last(), thus
implement it in ipc/util.c and not in a central location.

[akpm@linux-foundation.org: tweak comment, fix typo]

Link: https://lkml.kernel.org/r/20210425075208.11777-2-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: <1vier1@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

17d056e0 30-Jun-2021 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock

The patch solves three weaknesses in ipc/sem.c:

1) The initial read of use_global_lock in sem_lock() is an intentional
race. KCSAN detects these accesses and prints a warning.

2) The code assumes that plain C read/writes are not mangled by the CPU
or the compiler.

3) The comment it sysvipc_sem_proc_show() was hard to understand: The
rest of the comments in ipc/sem.c speaks about sem_perm.lock, and
suddenly this function speaks about ipc_lock_object().

To solve 1) and 2), use READ_ONCE()/WRITE_ONCE(). Plain C reads are used
in code that owns sma->sem_perm.lock.

The comment is updated to solve 3)

[manfred@colorfullife.com: use READ_ONCE()/WRITE_ONCE() for use_global_lock]
Link: https://lkml.kernel.org/r/20210627161919.3196-3-manfred@colorfullife.com

Link: https://lkml.kernel.org/r/20210514175319.12195-1-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: <1vier1@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bc8136a5 30-Jun-2021 Vasily Averin <vvs@virtuozzo.com>

ipc: use kmalloc for msg_queue and shmid_kernel

msg_queue and shmid_kernel are quite small objects, no need to use
kvmalloc for them. mhocko@: "Both of them are 256B on most 64b systems."

Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
common function for several ipc objects. It had kvmalloc call inside().
Later, this function went away and was finally replaced by direct kvmalloc
call, and now we can use more suitable kmalloc/kfree for them.

Link: https://lkml.kernel.org/r/0d0b6c9b-8af3-29d8-34e2-a565c53780f3@virtuozzo.com
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fc37a3b8 30-Jun-2021 Vasily Averin <vvs@virtuozzo.com>

ipc sem: use kvmalloc for sem_undo allocation

Patch series "ipc: allocations cleanup", v2.

Some ipc objects use the wrong allocation functions: small objects can use
kmalloc(), and vice versa, potentially large objects can use kmalloc().

This patch (of 2):

Size of sem_undo can exceed one page and with the maximum possible nsems =
32000 it can grow up to 64Kb. Let's switch its allocation to kvmalloc to
avoid user-triggered disruptive actions like OOM killer in case of
high-order memory shortage.

User triggerable high order allocations are quite a problem on heavily
fragmented systems. They can be a DoS vector.

Link: https://lkml.kernel.org/r/ebc3ac79-3190-520d-81ce-22ad194986ec@virtuozzo.com
Link: https://lkml.kernel.org/r/a6354fd9-2d55-2e63-dd4d-fa7dc1d11134@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c54b245d 28-Jun-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull user namespace rlimit handling update from Eric Biederman:
"This is the work mainly by Alexey Gladkov to limit rlimits to the
rlimits of the user that created a user namespace, and to allow users
to have stricter limits on the resources created within a user
namespace."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
cred: add missing return error code when set_cred_ucounts() failed
ucounts: Silence warning in dec_rlimit_ucounts
ucounts: Set ucount_max to the largest positive value the type can hold
kselftests: Add test to check for rlimit changes in different user namespaces
Reimplement RLIMIT_MEMLOCK on top of ucounts
Reimplement RLIMIT_SIGPENDING on top of ucounts
Reimplement RLIMIT_MSGQUEUE on top of ucounts
Reimplement RLIMIT_NPROC on top of ucounts
Use atomic_t for ucounts reference counting
Add a reference to ucounts for each cred
Increase size of ucounts to atomic_long_t


a11ddb37 22-May-2021 Varad Gautam <varad.gautam@suse.com>

ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7497835f 06-May-2021 Bhaskar Chowdhury <unixbhaskar@gmail.com>

ipc/sem.c: spelling fix

s/purpuse/purpose/

Link: https://lkml.kernel.org/r/20210319221432.26631-1-unixbhaskar@gmail.com
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b1989a3d 06-May-2021 Bhaskar Chowdhury <unixbhaskar@gmail.com>

ipc/sem.c: mundane typo fixes

s/runtine/runtime/
s/AQUIRE/ACQUIRE/
s/seperately/separately/
s/wont/won\'t/
s/succesfull/successful/

Link: https://lkml.kernel.org/r/20210326022240.26375-1-unixbhaskar@gmail.com
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d7c9e99a 22-Apr-2021 Alexey Gladkov <legion@kernel.org>

Reimplement RLIMIT_MEMLOCK on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Fix issue found by lkp robot.

v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

6e52a9f0 22-Apr-2021 Alexey Gladkov <legion@kernel.org>

Reimplement RLIMIT_MSGQUEUE on top of ucounts

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

6521f891 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: prepare for idmapped mounts

The various vfs_*() helpers are called by filesystems or by the vfs
itself to perform core operations such as create, link, mkdir, mknod, rename,
rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
inode is accessed through an idmapped mount map it into the
mount's user namespace and pass it down. Afterwards the checks and
operations are identical to non-idmapped mounts. If the initial user
namespace is passed nothing changes so non-idmapped mounts will see
identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

47291baa 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: make permission helpers idmapped mount aware

The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

ac73e3dc 15-Dec-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge misc updates from Andrew Morton:

- a few random little subsystems

- almost all of the MM patches which are staged ahead of linux-next
material. I'll trickle to post-linux-next work in as the dependents
get merged up.

Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
mm: cleanup kstrto*() usage
mm: fix fall-through warnings for Clang
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
mm:backing-dev: use sysfs_emit in macro defining functions
mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
mm: use sysfs_emit for struct kobject * uses
mm: fix kernel-doc markups
zram: break the strict dependency from lzo
zram: add stat to gather incompressible pages since zram set up
zram: support page writeback
mm/process_vm_access: remove redundant initialization of iov_r
mm/zsmalloc.c: rework the list_add code in insert_zspage()
mm/zswap: move to use crypto_acomp API for hardware acceleration
mm/zswap: fix passing zero to 'PTR_ERR' warning
mm/zswap: make struct kernel_param_ops definitions const
userfaultfd/selftests: hint the test runner on required privilege
userfaultfd/selftests: fix retval check for userfaultfd_open()
userfaultfd/selftests: always dump something in modes
userfaultfd: selftests: make __{s,u}64 format specifiers portable
...


dd3b614f 14-Dec-2020 Dmitry Safonov <0x7f454c46@gmail.com>

vm_ops: rename .split() callback to .may_split()

Rename the callback to reflect that it's not called *on* or *after* split,
but rather some time before the splitting to check if it's possible.

Link: https://lkml.kernel.org/r/20201013013416.390574-5-dima@arista.com
Signed-off-by: Dmitry Safonov <dima@arista.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f9b4240b 14-Dec-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull misc fixes from Christian Brauner:
"This contains several fixes which felt worth being combined into a
single branch:

- Use put_nsproxy() instead of open-coding it switch_task_namespaces()

- Kirill's work to unify lifecycle management for all namespaces. The
lifetime counters are used identically for all namespaces types.
Namespaces may of course have additional unrelated counters and
these are not altered. This work allows us to unify the type of the
counters and reduces maintenance cost by moving the counter in one
place and indicating that basic lifetime management is identical
for all namespaces.

- Peilin's fix adding three byte padding to Dmitry's
PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.

- Two smal patches to convert from the /* fall through */ comment
annotation to the fallthrough keyword annotation which I had taken
into my branch and into -next before df561f6688fe ("treewide: Use
fallthrough pseudo-keyword") made it upstream which fixed this
tree-wide.

Since I didn't want to invalidate all testing for other commits I
didn't rebase and kept them"

* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
nsproxy: use put_nsproxy() in switch_task_namespaces()
sys: Convert to the new fallthrough notation
signal: Convert to the new fallthrough notation
time: Use generic ns_common::count
cgroup: Use generic ns_common::count
mnt: Use generic ns_common::count
user: Use generic ns_common::count
pid: Use generic ns_common::count
ipc: Use generic ns_common::count
uts: Use generic ns_common::count
net: Use generic ns_common::count
ns: Add a common refcount into ns_common
ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()


fff1662c 04-Sep-2020 Tobias Klauser <tklauser@distanz.ch>

ipc: adjust proc_ipc_sem_dointvec definition to match prototype

Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer. Adjust the
signature of proc_ipc_sem_dointvec to match ctl_table.proc_handler which
fixes the following sparse error/warning:

ipc/ipc_sysctl.c:94:47: warning: incorrect type in argument 3 (different address spaces)
ipc/ipc_sysctl.c:94:47: expected void *buffer
ipc/ipc_sysctl.c:94:47: got void [noderef] __user *buffer
ipc/ipc_sysctl.c:194:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
ipc/ipc_sysctl.c:194:35: expected int ( [usertype] *proc_handler )( ... )
ipc/ipc_sysctl.c:194:35: got int ( * )( ... )

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/20200825105846.5193-1-tklauser@distanz.ch
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

df561f66 23-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

treewide: Use fallthrough pseudo-keyword

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>

137ec390 03-Aug-2020 Kirill Tkhai <ktkhai@virtuozzo.com>

ipc: Use generic ns_common::count

Switch over ipc namespaces to use the newly introduced common lifetime
counter.

Currently every namespace type has its own lifetime counter which is stored
in the specific namespace struct. The lifetime counters are used
identically for all namespaces types. Namespaces may of course have
additional unrelated counters and these are not altered.

This introduces a common lifetime counter into struct ns_common. The
ns_common struct encompasses information that all namespaces share. That
should include the lifetime counter since its common for all of them.

It also allows us to unify the type of the counters across all namespaces.
Most of them use refcount_t but one uses atomic_t and at least one uses
kref. Especially the last one doesn't make much sense since it's just a
wrapper around refcount_t since 2016 and actually complicates cleanup
operations by having to use container_of() to cast the correct namespace
struct out of struct ns_common.

Having the lifetime counter for the namespaces in one place reduces
maintenance cost. Not just because after switching all namespaces over we
will have removed more code than we added but also because the logic is
more easily understandable and we indicate to the user that the basic
lifetime requirements for all namespaces are currently identical.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/159644978697.604812.16592754423881032385.stgit@localhost.localdomain
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

ce14489c 11-Aug-2020 Liao Pingfang <liao.pingfang@zte.com.cn>

ipc/shm.c: remove the superfluous break

Remove the superfuous break, as there is a 'return' before it.

Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1594724361-11525-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

00898e85 11-Aug-2020 Alexey Dobriyan <adobriyan@gmail.com>

ipc: uninline functions

Two functions are only called via function pointers, don't bother
inlining them.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: http://lkml.kernel.org/r/20200710200312.GA960353@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

45e55300 07-Aug-2020 Peter Collingbourne <pcc@google.com>

mm: remove unnecessary wrapper function do_mmap_pgoff()

The current split between do_mmap() and do_mmap_pgoff() was introduced in
commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
do_mmap_pgoff()") to support MPX.

The wrapper function do_mmap_pgoff() always passed 0 as the value of the
vm_flags argument to do_mmap(). However, MPX support has subsequently
been removed from the kernel and there were no more direct callers of
do_mmap(); all calls were going via do_mmap_pgoff().

Simplify the code by removing do_mmap_pgoff() and changing all callers to
directly call do_mmap(), which now no longer takes a vm_flags argument.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d8ed45c5 08-Jun-2020 Michel Lespinasse <michel@lespinasse.org>

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e1eb26fa 07-Jun-2020 Giuseppe Scrivano <gscrivan@redhat.com>

ipc/namespace.c: use a work queue to free_ipc

the reason is to avoid a delay caused by the synchronize_rcu() call in
kern_umount() when the mqueue mount is freed.

the code:

#define _GNU_SOURCE
#include <sched.h>
#include <error.h>
#include <errno.h>
#include <stdlib.h>

int main()
{
int i;

for (i = 0; i < 1000; i++)
if (unshare(CLONE_NEWIPC) < 0)
error(EXIT_FAILURE, errno, "unshare");
}

goes from

Command being timed: "./ipc-namespace"
User time (seconds): 0.00
System time (seconds): 0.06
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05

to

Command being timed: "./ipc-namespace"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 96%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Link: http://lkml.kernel.org/r/20200225145419.527994-1-gscrivan@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4b78e201 07-Jun-2020 Jules Irenge <jbi.octave@gmail.com>

ipc/msg: add missing annotation for freeque()

Sparse reports a warning at freeque()

warning: context imbalance in freeque() - unexpected unlock

The root cause is the missing annotation at freeque()

Add the missing __releases(RCU) annotation
Add the missing __releases(&msq->q_perm) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lu Shuaibing <shuaibinglu@126.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: http://lkml.kernel.org/r/20200403160505.2832-2-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cb8e59cc 03-Jun-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from David Miller:

1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.

2) Add GSO partial support to igc, from Sasha Neftin.

3) Several cleanups and improvements to r8169 from Heiner Kallweit.

4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.

5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.

6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

8) Add sriov and vf support to hinic, from Luo bin.

9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...


e7c93cbf 03-Jun-2020 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread updates from Christian Brauner:
"We have been discussing using pidfds to attach to namespaces for quite
a while and the patches have in one form or another already existed
for about a year. But I wanted to wait to see how the general api
would be received and adopted.

This contains the changes to make it possible to use pidfds to attach
to the namespaces of a process, i.e. they can be passed as the first
argument to the setns() syscall.

When only a single namespace type is specified the semantics are
equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
equals setns(pidfd, CLONE_NEWNET).

However, when a pidfd is passed, multiple namespace flags can be
specified in the second setns() argument and setns() will attach the
caller to all the specified namespaces all at once or to none of them.

Specifying 0 is not valid together with a pidfd. Here are just two
obvious examples:

setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);

Allowing to also attach subsets of namespaces supports various
use-cases where callers setns to a subset of namespaces to retain
privilege, perform an action and then re-attach another subset of
namespaces.

Apart from significantly reducing the number of syscalls needed to
attach to all currently supported namespaces (eight "open+setns"
sequences vs just a single "setns()"), this also allows atomic setns
to a set of namespaces, i.e. either attaching to all namespaces
succeeds or we fail without having changed anything.

This is centered around a new internal struct nsset which holds all
information necessary for a task to switch to a new set of namespaces
atomically. Fwiw, with this change a pidfd becomes the only token
needed to interact with a container. I'm expecting this to be
picked-up by util-linux for nsenter rather soon.

Associated with this change is a shiny new test-suite dedicated to
setns() (for pidfds and nsfds alike)"

* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests/pidfd: add pidfd setns tests
nsproxy: attach to namespaces via pidfds
nsproxy: add struct nsset


da07f52d 15-May-2020 David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller <davem@davemloft.net>


5e698222 13-May-2020 Vasily Averin <vvs@virtuozzo.com>

ipc/util.c: sysvipc_find_ipc() incorrectly updates position index

Commit 89163f93c6f9 ("ipc/util.c: sysvipc_find_ipc() should increase
position index") is causing this bug (seen on 5.6.8):

# ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages

# ipcmk -Q
Message queue id: 0
# ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages
0x82db8127 0 root 644 0 0

# ipcmk -Q
Message queue id: 1
# ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages
0x82db8127 0 root 644 0 0
0x76d1fb2a 1 root 644 0 0

# ipcrm -q 0
# ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages
0x76d1fb2a 1 root 644 0 0
0x76d1fb2a 1 root 644 0 0

# ipcmk -Q
Message queue id: 2
# ipcrm -q 2
# ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages
0x76d1fb2a 1 root 644 0 0
0x76d1fb2a 1 root 644 0 0

# ipcmk -Q
Message queue id: 3
# ipcrm -q 1
# ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages
0x7c982867 3 root 644 0 0
0x7c982867 3 root 644 0 0
0x7c982867 3 root 644 0 0
0x7c982867 3 root 644 0 0

Whenever an IPC item with a low id is deleted, the items with higher ids
are duplicated, as if filling a hole.

new_pos should jump through hole of unused ids, pos can be updated
inside "for" cycle.

Fixes: 89163f93c6f9 ("ipc/util.c: sysvipc_find_ipc() should increase position index")
Reported-by: Andreas Schwab <schwab@suse.de>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/4921fe9b-9385-a2b4-1dc4-1099be6d2e39@virtuozzo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f2a8d52e 05-May-2020 Christian Brauner <christian.brauner@ubuntu.com>

nsproxy: add struct nsset

Add a simple struct nsset. It holds all necessary pieces to switch to a new
set of namespaces without leaving a task in a half-switched state which we
will make use of in the next patch. This patch switches the existing setns
logic over without causing a change in setns() behavior. This brings
setns() closer to how unshare() works(). The prepare_ns() function is
responsible to prepare all necessary information. This has two reasons.
First it minimizes dependencies between individual namespaces, i.e. all
install handler can expect that all fields are properly initialized
independent in what order they are called in. Second, this makes the code
easier to maintain and easier to follow if it needs to be changed.

The prepare_ns() helper will only be switched over to use a flags argument
in the next patch. Here it will still use nstype as a simple integer
argument which was argued would be clearer. I'm not particularly
opinionated about this if it really helps or not. The struct nsset itself
already contains the flags field since its name already indicates that it
can contain information required by different namespaces. None of this
should have functional consequences.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com

b5f20061 07-May-2020 Oleg Nesterov <oleg@redhat.com>

ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()

Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
longer works if the sender doesn't have rights to send a signal.

Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
to avoid check_kill_permission().

This needs the additional notify.sigev_signo != 0 check, shouldn't we
change do_mq_notify() to deny sigev_signo == 0 ?

Test-case:

#include <signal.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/wait.h>
#include <assert.h>

static int notified;

static void sigh(int sig)
{
notified = 1;
}

int main(void)
{
signal(SIGIO, sigh);

int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
assert(fd >= 0);

struct sigevent se = {
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGIO,
};
assert(mq_notify(fd, &se) == 0);

if (!fork()) {
assert(setuid(1) == 0);
mq_send(fd, "",1,0);
return 0;
}

wait(NULL);
mq_unlink("/mq");
assert(notified);
return 0;
}

[manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Reported-by: Yoji <yoji.fujihar.min@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

32927393 24-Apr-2020 Christoph Hellwig <hch@lst.de>

sysctl: pass kernel pointers to ->proc_handler

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

89163f93 10-Apr-2020 Vasily Averin <vvs@virtuozzo.com>

ipc/util.c: sysvipc_find_ipc() should increase position index

If seq_file .next function does not change position index, read after
some lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/b7a20945-e315-8bb0-21e6-3875c14a8494@virtuozzo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1cd377ba 06-Apr-2020 Jason Yan <yanaijie@huawei.com>

ipc/shm.c: make compat_ksys_shmctl() static

Fix the following sparse warning:

ipc/shm.c:1335:6: warning: symbol 'compat_ksys_shmctl' was not declared.
Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200403063933.24785-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

43afe4d3 06-Apr-2020 Somala Swaraj <somalaswaraj@gmail.com>

ipc/mqueue.c: fix a brace coding style issue

Signed-off-by: somala swaraj <somalaswaraj@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200301135530.18340-1-somalaswaraj@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d919b33d 06-Apr-2020 Alexey Dobriyan <adobriyan@gmail.com>

proc: faster open/read/close with "permanent" files

Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...

Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed. Files like /proc/cpuinfo which never
disappear simply do not need such protection.

Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.

Enable "permanent" flag for

/proc/cpuinfo
/proc/kmsg
/proc/modules
/proc/slabinfo
/proc/stat
/proc/sysvipc/*
/proc/swaps

More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.

This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".

N R t, s (before) t, s (after)
-----------------------------------------------------
64 4096 1.582458 1.530502 -3.2%
256 4096 6.371926 6.125168 -3.9%
1024 4096 25.64888 24.47528 -4.6%

Benchmark source:

#include <chrono>
#include <iostream>
#include <thread>
#include <vector>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;

int xxx = 0;

int glue(int n)
{
cpu_set_t m;
CPU_ZERO(&m);
CPU_SET(n, &m);
return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}

void f(int n)
{
glue(n % NR_CPUS);

while (*(volatile int *)&xxx == 0) {
}

for (int i = 0; i < R; i++) {
int fd = open(filename, O_RDONLY);
char buf[4096];
ssize_t rv = read(fd, buf, sizeof(buf));
asm volatile ("" :: "g" (rv));
close(fd);
}
}

int main(int argc, char *argv[])
{
if (argc < 4) {
std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
return 1;
}

N = atoi(argv[1]);
filename = argv[2];
R = atoi(argv[3]);

for (int i = 0; i < NR_CPUS; i++) {
if (glue(i) == 0)
break;
}

std::vector<std::thread> T;
T.reserve(N);
for (int i = 0; i < N; i++) {
T.emplace_back(f, i);
}

auto t0 = std::chrono::system_clock::now();
{
*(volatile int *)&xxx = 1;
for (auto& t: T) {
t.join();
}
}
auto t1 = std::chrono::system_clock::now();
std::chrono::duration<double> dt = t1 - t0;
std::cout << dt.count() << '
';

return 0;
}

P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.

[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

edf28f40 20-Feb-2020 Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>

Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"

This reverts commit a97955844807e327df11aa33869009d14d6b7de0.

Commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage
in exit_sem()") removes a lock that is needed. This leads to a process
looping infinitely in exit_sem() and can also lead to a crash. There is
a reproducer available in [1] and with the commit reverted the issue
does not reproduce anymore.

Using the reproducer found in [1] is fairly easy to reach a point where
one of the child processes is looping infinitely in exit_sem between
for(;;) and if (semid == -1) block, while it's trying to free its last
sem_undo structure which has already been freed by freeary().

Each sem_undo struct is on two lists: one per semaphore set (list_id)
and one per process (list_proc). The list_id list tracks undos by
semaphore set, and the list_proc by process.

Undo structures are removed either by freeary() or by exit_sem(). The
freeary function is invoked when the user invokes a syscall to remove a
semaphore set. During this operation freeary() traverses the list_id
associated with the semaphore set and removes the undo structures from
both the list_id and list_proc lists.

For this case, exit_sem() is called at process exit. Each process
contains a struct sem_undo_list (referred to as "ulp") which contains
the head for the list_proc list. When the process exits, exit_sem()
traverses this list to remove each sem_undo struct. As in freeary(),
whenever a sem_undo struct is removed from list_proc, it is also removed
from the list_id list.

Removing elements from list_id is safe for both exit_sem() and freeary()
due to sem_lock(). Removing elements from list_proc is not safe;
freeary() locks &un->ulp->lock when it performs
list_del_rcu(&un->list_proc) but exit_sem() does not (locking was
removed by commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list
lock usage in exit_sem()").

This can result in the following situation while executing the
reproducer [1] : Consider a child process in exit_sem() and the parent
in freeary() (because of semctl(sid[i], NSEM, IPC_RMID)).

- The list_proc for the child contains the last two undo structs A and
B (the rest have been removed either by exit_sem() or freeary()).

- The semid for A is 1 and semid for B is 2.

- exit_sem() removes A and at the same time freeary() removes B.

- Since A and B have different semid sem_lock() will acquire different
locks for each process and both can proceed.

The bug is that they remove A and B from the same list_proc at the same
time because only freeary() acquires the ulp lock. When exit_sem()
removes A it makes ulp->list_proc.next to point at B and at the same
time freeary() removes B setting B->semid=-1.

At the next iteration of for(;;) loop exit_sem() will try to remove B.

The only way to break from for(;;) is for (&un->list_proc ==
&ulp->list_proc) to be true which is not. Then exit_sem() will check if
B->semid=-1 which is and will continue looping in for(;;) until the
memory for B is reallocated and the value at B->semid is changed.

At that point, exit_sem() will crash attempting to unlink B from the
lists (this can be easily triggered by running the reproducer [1] a
second time).

To prove this scenario instrumentation was added to keep information
about each sem_undo (un) struct that is removed per process and per
semaphore set (sma).

CPU0 CPU1
[caller holds sem_lock(sma for A)] ...
freeary() exit_sem()
... ...
... sem_lock(sma for B)
spin_lock(A->ulp->lock) ...
list_del_rcu(un_A->list_proc) list_del_rcu(un_B->list_proc)

Undo structures A and B have different semid and sem_lock() operations
proceed. However they belong to the same list_proc list and they are
removed at the same time. This results into ulp->list_proc.next
pointing to the address of B which is already removed.

After reverting commit a97955844807 ("ipc,sem: remove uneeded
sem_undo_list lock usage in exit_sem()") the issue was no longer
reproducible.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1694779

Link: http://lkml.kernel.org/r/20191211191318.11860-1-ioanna-maria.alifieraki@canonical.com
Fixes: a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()")
Signed-off-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Herton R. Krzesinski <herton@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: <malat@debian.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

97a32539 03-Feb-2020 Alexey Dobriyan <adobriyan@gmail.com>

proc: convert everything to "struct proc_ops"

The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

llseek => proc_lseek
unlocked_ioctl => proc_ioctl

xxx => proc_xxx

delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

889b3317 03-Feb-2020 Lu Shuaibing <shuaibinglu@126.com>

ipc/msg.c: consolidate all xxxctl_down() functions

A use of uninitialized memory in msgctl_down() because msqid64 in
ksys_msgctl hasn't been initialized. The local | msqid64 | is created in
ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64
is never initialized before msgctl_down() checks msqid64->msg_qbytes.

KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
reports:

==================================================================
BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022

CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x75/0xae
__kumsan_report+0x17c/0x3e6
kumsan_report+0xe/0x20
msgctl_down+0x94/0x300
ksys_msgctl.constprop.14+0xef/0x260
do_syscall_64+0x7e/0x1f0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4400e9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kumsan: bad access detected
==================================================================

Syzkaller reproducer:
msgctl$IPC_RMID(0x0, 0x0)

C reproducer:
// autogenerated by syzkaller (https://github.com/google/syzkaller)

int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
syscall(__NR_msgctl, 0, 0, 0);
return 0;
}

[natechancellor@gmail.com: adjust indentation in ksys_msgctl]
Link: https://github.com/ClangBuiltLinux/linux/issues/829
Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
Signed-off-by: Lu Shuaibing <shuaibinglu@126.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
From: Andrew Morton <akpm@linux-foundation.org>
Subject: drivers/block/null_blk_main.c: fix layout

Each line here overflows 80 cols by exactly one character. Delete one tab
per line to fix.

Cc: Shaohua Li <shli@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8116b54e 03-Feb-2020 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: document and update memory barriers

Document and update the memory barriers in ipc/sem.c:

- Add smp_store_release() to wake_up_sem_queue_prepare() and
document why it is needed.

- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
as the pair for the barrier inside wake_up_sem_queue_prepare().

- Add comments to all barriers, and mention the rules in the block
regarding locking.

- Switch to using wake_q_add_safe().

Link: http://lkml.kernel.org/r/20191020123305.14715-6-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0d97a82b 03-Feb-2020 Manfred Spraul <manfred@colorfullife.com>

ipc/msg.c: update and document memory barriers

Transfer findings from ipc/mqueue.c:

- A control barrier was missing for the lockless receive case So in
theory, not yet initialized data may have been copied to user space -
obviously only for architectures where control barriers are not NOP.

- use smp_store_release(). In theory, the refount may have been
decreased to 0 already when wake_q_add() tries to get a reference.

Link: http://lkml.kernel.org/r/20191020123305.14715-5-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c5b2cbdb 03-Feb-2020 Manfred Spraul <manfred@colorfullife.com>

ipc/mqueue.c: update/document memory barriers

Update and document memory barriers for mqueue.c:

- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
acquire semantics if the value is STATE_READY.

- use wake_q_add_safe()

- document why __set_current_state() may be used:
Reading task->state cannot happen before the wake_q_add() call,
which happens while holding info->lock. Thus the spin_unlock()
is the RELEASE, and the spin_lock() is the ACQUIRE.

For completeness: there is also a 3 CPU scenario, if the to be woken
up task is already on another wake_q.
Then:
- CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
- CPU2: the spin_lock() of the waker is the ACQUIRE
- CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
- CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE

Link: http://lkml.kernel.org/r/20191020123305.14715-4-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Waiman Long <longman@redhat.com>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ed29f171 03-Feb-2020 Davidlohr Bueso <dave@stgolabs.net>

ipc/mqueue.c: remove duplicated code

pipelined_send() and pipelined_receive() are identical, so merge them.

[manfred@colorfullife.com: add changelog]
Link: http://lkml.kernel.org/r/20191020123305.14715-3-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c593642c 09-Dec-2019 Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>

treewide: Use sizeof_field() macro

Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().

This patch is generated using following script:

EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do

if [[ "$file" =~ $EXCLUDE_FILES ]]; then
continue
fi
sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done

Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net

3ca47e95 23-Apr-2019 Arnd Bergmann <arnd@arndb.de>

y2038: remove CONFIG_64BIT_TIME

The CONFIG_64BIT_TIME option is defined on all architectures, and can
be removed for simplicity now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

984035ad 25-Sep-2019 Joel Fernandes (Google) <joel@joelfernandes.org>

ipc/sem.c: convert to use built-in RCU list checking

CONFIG_PROVE_RCU_LIST requires list_for_each_entry_rcu() to pass a lockdep
expression if using srcu or locking for protection. It can only check
regular RCU protection, all other protection needs to be passed as lockdep
expression.

Link: http://lkml.kernel.org/r/20190830231817.76862-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Cc: Jonathan Derrick <jonathan.derrick@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c231740d 25-Sep-2019 Markus Elfring <elfring@users.sourceforge.net>

ipc/mqueue: improve exception handling in do_mq_notify()

Null pointers were assigned to local variables in a few cases as exception
handling. The jump target “out” was used where no meaningful data
processing actions should eventually be performed by branches of an if
statement then. Use an additional jump target for calling dev_kfree_skb()
directly.

Return also directly after error conditions were detected when no extra
clean-up is needed by this function implementation.

Link: http://lkml.kernel.org/r/592ef10e-0b69-72d0-9789-fc48f638fdfd@web.de
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

97b0b1ad 25-Sep-2019 Markus Elfring <elfring@users.sourceforge.net>

ipc/mqueue.c: delete an unnecessary check before the macro call dev_kfree_skb()

dev_kfree_skb() input parameter validation, thus the test around the call
is not needed.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/07477187-63e5-cc80-34c1-32dd16b38e12@web.de
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e170eb27 18-Sep-2019 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.mount-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs mount API infrastructure updates from Al Viro:
"Infrastructure bits of mount API conversions.

The rest is more of per-filesystem updates and that will happen
in separate pull requests"

* 'work.mount-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mtd: Provide fs_context-aware mount_mtd() replacement
vfs: Create fs_context-aware mount_bdev() replacement
new helper: get_tree_keyed()
vfs: set fs_context::user_ns for reconfigure


fb377eb8 05-Sep-2019 Arnd Bergmann <arnd@arndb.de>

ipc: fix sparc64 ipc() wrapper

Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
to a commit from my y2038 series in linux-5.1, as I missed the custom
sys_ipc() wrapper that sparc64 uses in place of the generic version that
I patched.

The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
now do not allow being called with the IPC_64 flag any more, resulting
in a -EINVAL error when they don't recognize the command.

Instead, the correct way to do this now is to call the internal
ksys_old_{sem,shm,msg}ctl() functions to select the API version.

As we generally move towards these functions anyway, change all of
sparc_ipc() to consistently use those in place of the sys_*() versions,
and move the required ksys_*() declarations into linux/syscalls.h

The IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
errors when ipc is disabled.

Reported-by: Matt Turner <mattst88@gmail.com>
Fixes: 275f22148e87 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
Cc: stable@vger.kernel.org
Tested-by: Matt Turner <mattst88@gmail.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

533770cc 03-Sep-2019 Al Viro <viro@zeniv.linux.org.uk>

new helper: get_tree_keyed()

For vfs_get_keyed_super users.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

933a90bf 19-Jul-2019 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs mount updates from Al Viro:
"The first part of mount updates.

Convert filesystems to use the new mount API"

* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...


eec4844f 18-Jul-2019 Matteo Croce <mcroce@redhat.com>

proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a318f12e 16-Jul-2019 Kees Cook <keescook@chromium.org>

ipc/mqueue.c: only perform resource calculation if user valid

Andreas Christoforou reported:

UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
9 * 2305843009213693951 cannot be represented in type 'long int'
...
Call Trace:
mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
evict+0x472/0x8c0 fs/inode.c:558
iput_final fs/inode.c:1547 [inline]
iput+0x51d/0x8c0 fs/inode.c:1573
mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
vfs_mkobj+0x39e/0x580 fs/namei.c:2892
prepare_open ipc/mqueue.c:731 [inline]
do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771

Which could be triggered by:

struct mq_attr attr = {
.mq_flags = 0,
.mq_maxmsg = 9,
.mq_msgsize = 0x1fffffffffffffff,
.mq_curmsgs = 0,
};

if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
perror("mq_open");

mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
preparing to return -EINVAL. During the cleanup, it calls
mqueue_evict_inode() which performed resource usage tracking math for
updating "user", before checking if there was a valid "user" at all
(which would indicate that the calculations would be sane). Instead,
delay this check to after seeing a valid "user".

The overflow was real, but the results went unused, so while the flaw is
harmless, it's noisy for kernel fuzzers, so just fix it by moving the
calculation under the non-NULL "user" where it actually gets used.

Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Andreas Christoforou <andreaschristofo@gmail.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b886d83c 01-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation version 2 of the license

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 315 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

709a643d 12-May-2019 Al Viro <viro@zeniv.linux.org.uk>

mqueue: set ->user_ns before ->get_tree()

... so that we could lift the capability checks into ->get_tree()
caller

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8e8ccf43 20-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 52

Based on 1 normalized pattern(s):

this file is released under gnu general public licence version 2 or
at your option any later version see the file copying for more
details

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520071857.941092988@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

99db46ea 14-May-2019 Manfred Spraul <manfred@colorfullife.com>

ipc: do cyclic id allocation for the ipc object.

For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.

To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.

To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements

Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.

- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.

For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.

Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.

2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.

3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.

Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.

1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%

Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).

V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).

[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3278a2c2 14-May-2019 Manfred Spraul <manfred@colorfullife.com>

ipc: conserve sequence numbers in ipcmni_extend mode

Rewrite, based on the patch from Waiman Long:

The mixing in of a sequence number into the IPC IDs is probably to avoid
ID reuse in userspace as much as possible. With ipcmni_extend mode, the
number of usable sequence numbers is greatly reduced leading to higher
chance of ID reuse.

To address this issue, we need to conserve the sequence number space as
much as possible. Right now, the sequence number is incremented for
every new ID created. In reality, we only need to increment the
sequence number when new allocated ID is not greater than the last one
allocated. It is in such case that the new ID may collide with an
existing one. This is being done irrespective of the ipcmni mode.

In order to avoid any races, the index is first allocated and then the
pointer is replaced.

Changes compared to the initial patch:
- Handle failures from idr_alloc().
- Avoid that concurrent operations can see the wrong sequence number.
(This is achieved by using idr_replace()).
- IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
ipcmni_seq_shift().
- IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().

Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5ac893b8 14-May-2019 Waiman Long <longman@redhat.com>

ipc: allow boot time extension of IPCMNI from 32k to 16M

The maximum number of unique System V IPC identifiers was limited to
32k. That limit should be big enough for most use cases.

However, there are some users out there requesting for more, especially
those that are migrating from Solaris which uses 24 bits for unique
identifiers. To satisfy the need of those users, a new boot time kernel
option "ipcmni_extend" is added to extend the IPCMNI value to 16M. This
is a 512X increase which should be big enough for users out there that
need a large number of unique IPC identifier.

The use of this new option will change the pattern of the IPC
identifiers returned by functions like shmget(2). An application that
depends on such pattern may not work properly. So it should only be
used if the users really need more than 32k of unique IPC numbers.

This new option does have the side effect of reducing the maximum number
of unique sequence numbers from 64k down to 128. So it is a trade-off.

The computation of a new IPC id is not done in the performance critical
path. So a little bit of additional overhead shouldn't have any real
performance impact.

Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a5091fda 14-May-2019 Davidlohr Bueso <dave@stgolabs.net>

ipc/mqueue: optimize msg_get()

Our msg priorities became an rbtree as of d6629859b36d ("ipc/mqueue:
improve performance of send/recv"). However, consuming a msg in
msg_get() remains logarithmic (still being better than the case before
of course). By applying well known techniques to cache pointers we can
have the node with the highest priority in O(1), which is specially nice
for the rt cases. Furthermore, some callers can call msg_get() in a
loop.

A new msg_tree_erase() helper is also added to encapsulate the tree
removal and node_cache game. Passes ltp mq testcases.

Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0ecb5821 14-May-2019 Davidlohr Bueso <dave@stgolabs.net>

ipc/mqueue: remove redundant wq task assignment

We already store the current task fo the new waiter before calling
wq_sleep() in both send and recv paths. Trivially remove the redundant
assignment.

Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d6a2946a 14-May-2019 Li Rongqing <lirongqing@baidu.com>

ipc: prevent lockup on alloc_msg and free_msg

msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
enabled on large memory SMP systems, the pages initialization can take a
long time, if msgctl10 requests a huge block memory, and it will block
rcu scheduler, so release cpu actively.

After adding schedule() in free_msg, free_msg can not be called when
holding spinlock, so adding msg to a tmp list, and free it out of
spinlock

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
msgctl10 R running task 21608 32505 2794 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:__is_insn_slot_addr+0xfb/0x250
Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
kernel_text_address+0xc1/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
create_object+0x380/0x650
__kmalloc+0x14c/0x2b0
load_msg+0x38/0x1a0
do_msgsnd+0x19e/0xcf0
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
msgctl10 R running task 21608 32170 32155 0x00000082
Call Trace:
preempt_schedule_irq+0x4c/0xb0
retint_kernel+0x1b/0x2d
RIP: 0010:lock_acquire+0x4d/0x340
Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
is_bpf_text_address+0x32/0xe0
kernel_text_address+0xec/0x100
__kernel_text_address+0xe/0x30
unwind_get_return_address+0x2f/0x50
__save_stack_trace+0x92/0x100
save_stack+0x32/0xb0
__kasan_slab_free+0x130/0x180
kfree+0xfa/0x2d0
free_msg+0x24/0x50
do_msgrcv+0x508/0xe60
do_syscall_64+0x117/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Davidlohr said:
"So after releasing the lock, the msg rbtree/list is empty and new
calls will not see those in the newly populated tmp_msg list, and
therefore they cannot access the delayed msg freeing pointers, which
is good. Also the fact that the node_cache is now freed before the
actual messages seems to be harmless as this is wanted for
msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
info->lock the thing is freed anyway so it should not change things"

Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

80f23212 07-May-2019 Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
"Highlights:

1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.

2) Add fib_sync_mem to control the amount of dirty memory we allow to
queue up between synchronize RCU calls, from David Ahern.

3) Make flow classifier more lockless, from Vlad Buslov.

4) Add PHY downshift support to aquantia driver, from Heiner
Kallweit.

5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
contention on SLAB spinlocks in heavy RPC workloads.

6) Partial GSO offload support in XFRM, from Boris Pismenny.

7) Add fast link down support to ethtool, from Heiner Kallweit.

8) Use siphash for IP ID generator, from Eric Dumazet.

9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
entries, from David Ahern.

10) Move skb->xmit_more into a per-cpu variable, from Florian
Westphal.

11) Improve eBPF verifier speed and increase maximum program size,
from Alexei Starovoitov.

12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
spinlocks. From Neil Brown.

13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.

14) Improve link partner cap detection in generic PHY code, from
Heiner Kallweit.

15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
Maguire.

16) Remove SKB list implementation assumptions in SCTP, your's truly.

17) Various cleanups, optimizations, and simplifications in r8169
driver. From Heiner Kallweit.

18) Add memory accounting on TX and RX path of SCTP, from Xin Long.

19) Switch PHY drivers over to use dynamic featue detection, from
Heiner Kallweit.

20) Support flow steering without masking in dpaa2-eth, from Ioana
Ciocoi.

21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
Pirko.

22) Increase the strict parsing of current and future netlink
attributes, also export such policies to userspace. From Johannes
Berg.

23) Allow DSA tag drivers to be modular, from Andrew Lunn.

24) Remove legacy DSA probing support, also from Andrew Lunn.

25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
Haabendal.

26) Add a generic tracepoint for TX queue timeouts to ease debugging,
from Cong Wang.

27) More indirect call optimizations, from Paolo Abeni"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
cxgb4: Fix error path in cxgb4_init_module
net: phy: improve pause mode reporting in phy_print_status
dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
net: macb: Change interrupt and napi enable order in open
net: ll_temac: Improve error message on error IRQ
net/sched: remove block pointer from common offload structure
net: ethernet: support of_get_mac_address new ERR_PTR error
net: usb: smsc: fix warning reported by kbuild test robot
staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
net: dsa: support of_get_mac_address new ERR_PTR error
net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
vrf: sit mtu should not be updated when vrf netdev is the link
net: dsa: Fix error cleanup path in dsa_init_module
l2tp: Fix possible NULL pointer dereference
taprio: add null check on sched_nest to avoid potential null pointer dereference
net: mvpp2: cls: fix less than zero check on a u32 variable
net_sched: sch_fq: handle non connected flows
net_sched: sch_fq: do not assume EDT packets are ordered
net: hns3: use devm_kcalloc when allocating desc_cb
net: hns3: some cleanup for struct hns3_enet_ring
...


015d7956 15-Apr-2019 Al Viro <viro@zeniv.linux.org.uk>

mqueue: switch to ->free_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8f0db018 01-Apr-2019 NeilBrown <neilb@suse.com>

rhashtable: use bit_spin_locks to protect hash bucket.

This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
bucket pointer to lock the hash chain for that bucket.

The benefits of a bit spin_lock are:
- no need to allocate a separate array of locks.
- no need to have a configuration option to guide the
choice of the size of this array
- locking cost is often a single test-and-set in a cache line
that will have to be loaded anyway. When inserting at, or removing
from, the head of the chain, the unlock is free - writing the new
address in the bucket head implicitly clears the lock bit.
For __rhashtable_insert_fast() we ensure this always happens
when adding a new key.
- even when lockings costs 2 updates (lock and unlock), they are
in a cacheline that needs to be read anyway.

The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.

Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved. This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.

As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.

To enhance type checking, a new struct is introduced to represent the
pointer plus lock-bit
that is stored in the bucket-table. This is "struct rhash_lock_head"
and is empty. A pointer to this needs to be cast to either an
unsigned lock, or a "struct rhash_head *" to be useful.
Variables of this type are most often called "bkt".

Previously "pprev" would sometimes point to a bucket, and sometimes a
->next pointer in an rhash_head. As these are now different types,
pprev is NULL when it would have pointed to the bucket. In that case,
'blk' is used, together with correct locking protocol.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7b47a9e7 12-Mar-2019 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs mount infrastructure updates from Al Viro:
"The rest of core infrastructure; no new syscalls in that pile, but the
old parts are switched to new infrastructure. At that point
conversions of individual filesystems can happen independently; some
are done here (afs, cgroup, procfs, etc.), there's also a large series
outside of that pile dealing with NFS (quite a bit of option-parsing
stuff is getting used there - it's one of the most convoluted
filesystems in terms of mount-related logics), but NFS bits are the
next cycle fodder.

It got seriously simplified since the last cycle; documentation is
probably the weakest bit at the moment - I considered dropping the
commit introducing Documentation/filesystems/mount_api.txt (cutting
the size increase by quarter ;-), but decided that it would be better
to fix it up after -rc1 instead.

That pile allows to do followup work in independent branches, which
should make life much easier for the next cycle. fs/super.c size
increase is unpleasant; there's a followup series that allows to
shrink it considerably, but I decided to leave that until the next
cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
afs: Use fs_context to pass parameters over automount
afs: Add fs_context support
vfs: Add some logging to the core users of the fs_context log
vfs: Implement logging through fs_context
vfs: Provide documentation for new mount API
vfs: Remove kern_mount_data()
hugetlbfs: Convert to fs_context
cpuset: Use fs_context
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
cgroup: store a reference to cgroup_ns into cgroup_fs_context
cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
cgroup_do_mount(): massage calling conventions
cgroup: stash cgroup_root reference into cgroup_fs_context
cgroup2: switch to option-by-option parsing
cgroup1: switch to option-by-option parsing
cgroup: take options parsing into ->parse_monolithic()
cgroup: fold cgroup1_mount() into cgroup1_get_tree()
cgroup: start switching to fs_context
ipc: Convert mqueue fs to fs_context
proc: Add fs_context support to procfs
...


4a2ae929 07-Mar-2019 Gustavo A. R. Silva <gustavo@embeddedor.com>

ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size

Use kvzalloc() instead of kvmalloc() and memset().

Also, make use of the struct_size() helper instead of the open-coded
version in order to avoid any potential type mistakes.

This code was detected with the help of Coccinelle.

Link: http://lkml.kernel.org/r/20190131214221.GA28930@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

667da6a2 07-Mar-2019 Mathieu Malaterre <malat@debian.org>

ipc: annotate implicit fall through

There is a plan to build the kernel with -Wimplicit-fallthrough and this
place in the code produced a warning (W=1).

This commit remove the following warning:

ipc/sem.c:1683:6: warning: this statement may fall through [-Wimplicit-fallthrough=]

Link: http://lkml.kernel.org/r/20190114203608.18218-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

935c6912 01-Nov-2018 David Howells <dhowells@redhat.com>

ipc: Convert mqueue fs to fs_context

Convert the mqueue filesystem to use the filesystem context stuff.

Notes:

(1) The relevant ipc namespace is selected in when the context is
initialised (and it defaults to the current task's ipc namespace).
The caller can override this before calling vfs_get_tree().

(2) Rather than simply calling kern_mount_data(), mq_init_ns() and
mq_internal_mount() create a context, adjust it and then do the rest
of the mount procedure.

(3) The lazy mqueue mounting on creation of a new namespace is retained
from a previous patch, but the avoidance of sget() if no superblock
yet exists is reverted and the superblock is again keyed on the
namespace pointer.

Yes, there was a performance gain in not searching the superblock
hash, but it's only paid once per ipc namespace - and only if someone
uses mqueue within that namespace, so I'm not sure it's worth it,
especially as calling sget() allows avoidance of recursion.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8dabe724 06-Jan-2019 Arnd Bergmann <arnd@arndb.de>

y2038: syscalls: rename y2038 compat syscalls

A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.

The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.

Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.

In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

275f2214 31-Dec-2018 Arnd Bergmann <arnd@arndb.de>

ipc: rename old-style shmctl/semctl/msgctl syscalls

The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.

For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.

The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.

As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.

A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

58fa4a41 16-Jan-2019 Arnd Bergmann <arnd@arndb.de>

ipc: introduce ksys_ipc()/compat_ksys_ipc() for s390

The sys_ipc() and compat_ksys_ipc() functions are meant to only
be used from the system call table, not called by another function.

Introduce ksys_*() interfaces for this purpose, as we have done
for many other system calls.

Link: https://lore.kernel.org/lkml/20190116131527.2071570-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
[heiko.carstens@de.ibm.com: compile fix for !CONFIG_COMPAT]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

8c81ddd2 30-Oct-2018 Waiman Long <longman@redhat.com>

ipc: IPCMNI limit check for semmni

For SysV semaphores, the semmni value is the last part of the 4-element
sem number array. To make semmni behave in a similar way to msgmni and
shmmni, we can't directly use the _minmax handler. Instead, a special sem
specific handler is added to check the last argument to make sure that it
is limited to the [0, IPCMNI] range. An error will be returned if this is
not the case.

Link: http://lkml.kernel.org/r/1536352137-12003-3-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6730e658 30-Oct-2018 Waiman Long <longman@redhat.com>

ipc: IPCMNI limit check for msgmni and shmmni

Patch series "ipc: IPCMNI limit check for *mni & increase that limit", v9.

The sysctl parameters msgmni, shmmni and semmni have an inherent limit of
IPC_MNI (32k). However, users may not be aware of that because they can
write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.

The real IPCMNI limit is now enforced to make sure that users won't put in
an unrealistic value. The first 2 patches enforce the limits.

There are also users out there requesting increase in the IPCMNI value.
The last 2 patches attempt to do that by using a boot kernel parameter
"ipcmni_extend" to increase the IPCMNI limit from 32k to 8M if the users
really want the extended value.

This patch (of 4):

A user can write arbitrary integer values to msgmni and shmmni sysctl
parameters without getting error, but the actual limit is really IPCMNI
(32k). This can mislead users as they think they can get a value that is
not real.

The right limits are now set for msgmni and shmmni so that the users will
become aware if they set a value outside of the acceptable range.

Link: http://lkml.kernel.org/r/1536352137-12003-2-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4dcb9239 25-Oct-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timekeeping updates from Thomas Gleixner:
"The timers and timekeeping departement provides:

- Another large y2038 update with further preparations for providing
the y2038 safe timespecs closer to the syscalls.

- An overhaul of the SHCMT clocksource driver

- SPDX license identifier updates

- Small cleanups and fixes all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tick/sched : Remove redundant cpu_online() check
clocksource/drivers/dw_apb: Add reset control
clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
clocksource/drivers: Unify the names to timer-* format
clocksource/drivers/sh_cmt: Add R-Car gen3 support
dt-bindings: timer: renesas: cmt: document R-Car gen3 support
clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
clocksource/drivers/sh_cmt: Fixup for 64-bit machines
clocksource/drivers/sh_tmu: Convert to SPDX identifiers
clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
clocksource/drivers/sh_cmt: Convert to SPDX identifiers
clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
clocksource: Convert to using %pOFn instead of device_node.name
tick/broadcast: Remove redundant check
RISC-V: Request newstat syscalls
y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
y2038: socket: Change recvmmsg to use __kernel_timespec
y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
y2038: utimes: Rework #ifdef guards for compat syscalls
...


ba9f6f89 24-Oct-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull siginfo updates from Eric Biederman:
"I have been slowly sorting out siginfo and this is the culmination of
that work.

The primary result is in several ways the signal infrastructure has
been made less error prone. The code has been updated so that manually
specifying SEND_SIG_FORCED is never necessary. The conversion to the
new siginfo sending functions is now complete, which makes it
difficult to send a signal without filling in the proper siginfo
fields.

At the tail end of the patchset comes the optimization of decreasing
the size of struct siginfo in the kernel from 128 bytes to about 48
bytes on 64bit. The fundamental observation that enables this is by
definition none of the known ways to use struct siginfo uses the extra
bytes.

This comes at the cost of a small user space observable difference.
For the rare case of siginfo being injected into the kernel only what
can be copied into kernel_siginfo is delivered to the destination, the
rest of the bytes are set to 0. For cases where the signal and the
si_code are known this is safe, because we know those bytes are not
used. For cases where the signal and si_code combination is unknown
the bits that won't fit into struct kernel_siginfo are tested to
verify they are zero, and the send fails if they are not.

I made an extensive search through userspace code and I could not find
anything that would break because of the above change. If it turns out
I did break something it will take just the revert of a single change
to restore kernel_siginfo to the same size as userspace siginfo.

Testing did reveal dependencies on preferring the signo passed to
sigqueueinfo over si->signo, so bit the bullet and added the
complexity necessary to handle that case.

Testing also revealed bad things can happen if a negative signal
number is passed into the system calls. Something no sane application
will do but something a malicious program or a fuzzer might do. So I
have fixed the code that performs the bounds checks to ensure negative
signal numbers are handled"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
signal: Guard against negative signal numbers in copy_siginfo_from_user32
signal: Guard against negative signal numbers in copy_siginfo_from_user
signal: In sigqueueinfo prefer sig not si_signo
signal: Use a smaller struct siginfo in the kernel
signal: Distinguish between kernel_siginfo and siginfo
signal: Introduce copy_siginfo_from_user and use it's return value
signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
signal: Fail sigqueueinfo if si_signo != sig
signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
signal/unicore32: Use force_sig_fault where appropriate
signal/unicore32: Generate siginfo in ucs32_notify_die
signal/unicore32: Use send_sig_fault where appropriate
signal/arc: Use force_sig_fault where appropriate
signal/arc: Push siginfo generation into unhandled_exception
signal/ia64: Use force_sig_fault where appropriate
signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
signal/ia64: Use the generic force_sigsegv in setup_frame
signal/arm/kvm: Use send_sig_mceerr
signal/arm: Use send_sig_fault where appropriate
signal/arm: Use force_sig_fault where appropriate
...


59cf0a93 05-Oct-2018 Kees Cook <keescook@chromium.org>

ipc/shm.c: use ERR_CAST() for shm_lock() error return

This uses ERR_CAST() instead of an open-coded cast, as it is casting
across structure pointers, which upsets __randomize_layout:

ipc/shm.c: In function `shm_lock':
ipc/shm.c:209:9: note: randstruct: casting between randomized structure pointer types (ssa): `struct shmid_kernel' and `struct kern_ipc_perm'

return (void *)ipcp;
^~~~~~~~~~~~

Link: http://lkml.kernel.org/r/20180919180722.GA15073@beast
Fixes: 82061c57ce93 ("ipc: drop ipc_lock()")
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ae7795bc 25-Sep-2018 Eric W. Biederman <ebiederm@xmission.com>

signal: Distinguish between kernel_siginfo and siginfo

Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.

The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.

So split siginfo in two; kernel_siginfo and siginfo. Keeping the
traditional name for the userspace definition. While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.

The definition of struct kernel_siginfo I have put in include/signal_types.h

A set of buildtime checks has been added to verify the two structures have
the same field offsets.

To make it easy to verify the change kernel_siginfo retains the same
size as siginfo. The reduction in size comes in a following change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

9c21dae2 04-Sep-2018 Davidlohr Bueso <dbueso@suse.de>

ipc/shm: properly return EIDRM in shm_lock()

When getting rid of the general ipc_lock(), this was missed furthermore,
making the comment around the ipc object validity check bogus. Under
EIDRM conditions, callers will in turn not see the error and continue
with the operation.

Link: http://lkml.kernel.org/r/20180824030920.GD3677@linux-r8p5
Link: http://lkml.kernel.org/r/20180823024051.GC13343@shao2-debian
Fixes: 82061c57ce9 ("ipc: drop ipc_lock()")
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9afc5eee 12-Jul-2018 Arnd Bergmann <arnd@arndb.de>

y2038: globally rename compat_time to old_time32

Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:

Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.

The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:

old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()

As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.

I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.

This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

2a9d6481 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc/util.c: update return value of ipc_getref from int to bool

ipc_getref has still a return value of type "int", matching the atomic_t
interface of atomic_inc_not_zero()/atomic_add_unless().

ipc_getref now uses refcount_inc_not_zero, which has a return value of
type "bool".

Therefore, update the return code to avoid implicit conversions.

Link: http://lkml.kernel.org/r/20180712185241.4017-13-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

27c331a1 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc/util.c: further variable name cleanups

The varable names got a mess, thus standardize them again:

id: user space id. Called semid, shmid, msgid if the type is known.
Most functions use "id" already.
idx: "index" for the idr lookup
Right now, some functions use lid, ipc_addid() already uses idx as
the variable name.
seq: sequence number, to avoid quick collisions of the user space id
key: user space key, used for the rhash tree

Link: http://lkml.kernel.org/r/20180712185241.4017-12-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

eae04d25 21-Aug-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc: simplify ipc initialization

Now that we know that rhashtable_init() will not fail, we can get rid of a
lot of the unnecessary cleanup paths when the call errored out.

[manfred@colorfullife.com: variable name added to util.h to resolve checkpatch warning]
Link: http://lkml.kernel.org/r/20180712185241.4017-11-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dc2c8c84 21-Aug-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc: get rid of ids->tables_initialized hack

In sysvipc we have an ids->tables_initialized regarding the rhashtable,
introduced in 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots
of keys")

It's there, specifically, to prevent nil pointer dereferences, from using
an uninitialized api. Considering how rhashtable_init() can fail
(probably due to ENOMEM, if anything), this made the overall ipc
initialization capable of failure as well. That alone is ugly, but fine,
however I've spotted a few issues regarding the semantics of
tables_initialized (however unlikely they may be):

- There is inconsistency in what we return to userspace: ipc_addid()
returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
returns EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc
structure isn't found, and can therefore call into ipcget() callbacks.

Now that rhashtable initialization cannot fail, we can properly get rid of
the hack altogether.

[manfred@colorfullife.com: commit id extended to 12 digits]
Link: http://lkml.kernel.org/r/20180712185241.4017-10-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

82061c57 21-Aug-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc: drop ipc_lock()

ipc/util.c contains multiple functions to get the ipc object pointer given
an id number.

There are two sets of function: One set verifies the sequence counter part
of the id number, other functions do not check the sequence counter.

The standard for function names in ipc/util.c is
- ..._check() functions verify the sequence counter
- ..._idr() functions do not verify the sequence counter

ipc_lock() is an exception: It does not verify the sequence counter value,
but this is not obvious from the function name.

Furthermore, shm.c is the only user of this helper. Thus, we can simply
move the logic into shm_lock() and get rid of the function altogether.

[manfred@colorfullife.com: most of changelog]
Link: http://lkml.kernel.org/r/20180712185241.4017-7-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2e5ceb45 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc/util.c: correct comment in ipc_obtain_object_check

The comment that explains ipc_obtain_object_check is wrong: The function
checks the sequence number, not the reference counter.

Note that checking the reference counter would be meaningless: The
reference counter is decreased without holding any locks, thus an object
with kern_ipc_perm.deleted=true may disappear at the end of the next rcu
grace period.

Link: http://lkml.kernel.org/r/20180712185241.4017-6-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4241c1a3 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc: rename ipcctl_pre_down_nolock()

Both the comment and the name of ipcctl_pre_down_nolock() are misleading:
The function must be called while holdling the rw semaphore.

Therefore the patch renames the function to ipcctl_obtain_check(): This
name matches the other names used in util.c:

- "obtain" function look up a pointer in the idr, without
acquiring the object lock.
- The caller is responsible for locking.
- _check means that the sequence number is checked.

Link: http://lkml.kernel.org/r/20180712185241.4017-5-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

39cfffd7 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()

ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early, and by
modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
issue with reading kern_ipc_perm.seq, here both read and write to already
released memory could happen.

Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e2652ae6 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc: reorganize initialization of kern_ipc_perm.seq

ipc_addid() initializes kern_ipc_perm.seq after having called idr_alloc()
(within ipc_idr_alloc()).

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
may see an uninitialized value.

The patch moves the initialization of kern_ipc_perm.seq before the calls
of idr_alloc().

Notes:
1) This patch has a user space visible side effect:
If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
if semget()/msgget()/shmget() fails in the final step of adding the id
to the rhash tree, then .._next_id is cleared. Before the patch, is
remained unmodified.

There is no change of the behavior after a successful ..get() call: It
always clears .._next_id, there is no impact to non checkpoint/restore
code as that code does not use .._next_id.

2) The patch correctly documents that after a call to ipc_idr_alloc(),
the full tear-down sequence must be used. The callers of ipc_addid()
do not fullfill that, i.e. more bugfixes are required.

The patch is a squash of a patch from Dmitry and my own changes.

Link: http://lkml.kernel.org/r/20180712185241.4017-3-manfred@colorfullife.com
Reported-by: syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

615c999c 21-Aug-2018 Manfred Spraul <manfred@colorfullife.com>

ipc: compute kern_ipc_perm.id under the ipc lock

ipc_addid() initializes kern_ipc_perm.id after having called
ipc_idr_alloc().

Thus a parallel semctl() or msgctl() that uses e.g. MSG_STAT may use this
unitialized value as the return code.

The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

The issues is related to the finding of
syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
issue with kern_ipc_perm.seq

Link: http://lkml.kernel.org/r/20180712185241.4017-2-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9a76aba0 15-Aug-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
"Highlights:

- Gustavo A. R. Silva keeps working on the implicit switch fallthru
changes.

- Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
Luca Coelho.

- Re-enable ASPM in r8169, from Kai-Heng Feng.

- Add virtual XFRM interfaces, which avoids all of the limitations of
existing IPSEC tunnels. From Steffen Klassert.

- Convert GRO over to use a hash table, so that when we have many
flows active we don't traverse a long list during accumluation.

- Many new self tests for routing, TC, tunnels, etc. Too many
contributors to mention them all, but I'm really happy to keep
seeing this stuff.

- Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.

- Lots of cleanups and fixes in L2TP code from Guillaume Nault.

- Add IPSEC offload support to netdevsim, from Shannon Nelson.

- Add support for slotting with non-uniform distribution to netem
packet scheduler, from Yousuk Seung.

- Add UDP GSO support to mlx5e, from Boris Pismenny.

- Support offloading of Team LAG in NFP, from John Hurley.

- Allow to configure TX queue selection based upon RX queue, from
Amritha Nambiar.

- Support ethtool ring size configuration in aquantia, from Anton
Mikaev.

- Support DSCP and flowlabel per-transport in SCTP, from Xin Long.

- Support list based batching and stack traversal of SKBs, this is
very exciting work. From Edward Cree.

- Busyloop optimizations in vhost_net, from Toshiaki Makita.

- Introduce the ETF qdisc, which allows time based transmissions. IGB
can offload this in hardware. From Vinicius Costa Gomes.

- Add parameter support to devlink, from Moshe Shemesh.

- Several multiplication and division optimizations for BPF JIT in
nfp driver, from Jiong Wang.

- Lots of prepatory work to make more of the packet scheduler layer
lockless, when possible, from Vlad Buslov.

- Add ACK filter and NAT awareness to sch_cake packet scheduler, from
Toke Høiland-Jørgensen.

- Support regions and region snapshots in devlink, from Alex Vesker.

- Allow to attach XDP programs to both HW and SW at the same time on
a given device, with initial support in nfp. From Jakub Kicinski.

- Add TLS RX offload and support in mlx5, from Ilya Lesokhin.

- Use PHYLIB in r8169 driver, from Heiner Kallweit.

- All sorts of changes to support Spectrum 2 in mlxsw driver, from
Ido Schimmel.

- PTP support in mv88e6xxx DSA driver, from Andrew Lunn.

- Make TCP_USER_TIMEOUT socket option more accurate, from Jon
Maxwell.

- Support for templates in packet scheduler classifier, from Jiri
Pirko.

- IPV6 support in RDS, from Ka-Cheong Poon.

- Native tproxy support in nf_tables, from Máté Eckl.

- Maintain IP fragment queue in an rbtree, but optimize properly for
in-order frags. From Peter Oskolkov.

- Improvde handling of ACKs on hole repairs, from Yuchung Cheng"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
hv/netvsc: Fix NULL dereference at single queue mode fallback
net: filter: mark expected switch fall-through
xen-netfront: fix warn message as irq device name has '/'
cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
net: dsa: mv88e6xxx: missing unlock on error path
rds: fix building with IPV6=m
inet/connection_sock: prefer _THIS_IP_ to current_text_addr
net: dsa: mv88e6xxx: bitwise vs logical bug
net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
ieee802154: hwsim: using right kind of iteration
net: hns3: Add vlan filter setting by ethtool command -K
net: hns3: Set tx ring' tc info when netdev is up
net: hns3: Remove tx ring BD len register in hns3_enet
net: hns3: Fix desc num set to default when setting channel
net: hns3: Fix for phy link issue when using marvell phy driver
net: hns3: Fix for information of phydev lost problem when down/up
net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
net: hns3: Add support for serdes loopback selftest
bnxt_en: take coredump_record structure off stack
...


a66b4cd1 13-Aug-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs open-related updates from Al Viro:

- "do we need fput() or put_filp()" rules are gone - it's always fput()
now. We keep track of that state where it belongs - in ->f_mode.

- int *opened mess killed - in finish_open(), in ->atomic_open()
instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

- alloc_file() wrappers with saner calling conventions are introduced
(alloc_file_clone() and alloc_file_pseudo()); callers converted, with
much simplification.

- while we are at it, saner calling conventions for path_init() and
link_path_walk(), simplifying things inside fs/namei.c (both on
open-related paths and elsewhere).

* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
few more cleanups of link_path_walk() callers
allow link_path_walk() to take ERR_PTR()
make path_init() unconditionally paired with terminate_walk()
document alloc_file() changes
make alloc_file() static
do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
new helper: alloc_file_clone()
create_pipe_files(): switch the first allocation to alloc_file_pseudo()
anon_inode_getfile(): switch to alloc_file_pseudo()
hugetlb_file_setup(): switch to alloc_file_pseudo()
ocxlflash_getfile(): switch to alloc_file_pseudo()
cxl_getfile(): switch to alloc_file_pseudo()
... and switch shmem_file_setup() to alloc_file_pseudo()
__shmem_file_setup(): reorder allocations
new wrapper: alloc_file_pseudo()
kill FILE_{CREATED,OPENED}
switch atomic_open() and lookup_open() to returning 0 in all success cases
document ->atomic_open() changes
->atomic_open(): return 0 in all success cases
get rid of 'opened' in path_openat() and the helpers downstream
...


c1c8626f 05-Aug-2018 David S. Miller <davem@davemloft.net>

Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net

Lots of overlapping changes, mostly trivial in nature.

The mlxsw conflict was resolving using the example
resolution at:

https://github.com/jpirko/linux_mlxsw/blob/combined_queue/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c

Signed-off-by: David S. Miller <davem@davemloft.net>


eec3636a 02-Aug-2018 Jane Chu <jane.chu@oracle.com>

ipc/shm.c add ->pagesize function to shm_vm_ops

Commit 05ea88608d4e ("mm, hugetlbfs: introduce ->pagesize() to
vm_operations_struct") adds a new ->pagesize() function to
hugetlb_vm_ops, intended to cover all hugetlbfs backed files.

With System V shared memory model, if "huge page" is specified, the
"shared memory" is backed by hugetlbfs files, but the mappings initiated
via shmget/shmat have their original vm_ops overwritten with shm_vm_ops,
so we need to add a ->pagesize function to shm_vm_ops. Otherwise,
vma_kernel_pagesize() returns PAGE_SIZE given a hugetlbfs backed vma,
result in below BUG:

fs/hugetlbfs/inode.c
443 if (unlikely(page_mapped(page))) {
444 BUG_ON(truncate_op);

resulting in

hugetlbfs: oracle (4592): Using mlock ulimits for SHM_HUGETLB is deprecated
------------[ cut here ]------------
kernel BUG at fs/hugetlbfs/inode.c:444!
Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 ...
CPU: 35 PID: 5583 Comm: oracle_5583_sbt Not tainted 4.14.35-1829.el7uek.x86_64 #2
RIP: 0010:remove_inode_hugepages+0x3db/0x3e2
....
Call Trace:
hugetlbfs_evict_inode+0x1e/0x3e
evict+0xdb/0x1af
iput+0x1a2/0x1f7
dentry_unlink_inode+0xc6/0xf0
__dentry_kill+0xd8/0x18d
dput+0x1b5/0x1ed
__fput+0x18b/0x216
____fput+0xe/0x10
task_work_run+0x90/0xa7
exit_to_usermode_loop+0xdd/0x116
do_syscall_64+0x187/0x1ae
entry_SYSCALL_64_after_hwframe+0x150/0x0

[jane.chu@oracle.com: relocate comment]
Link: http://lkml.kernel.org/r/20180731044831.26036-1-jane.chu@oracle.com
Link: http://lkml.kernel.org/r/20180727211727.5020-1-jane.chu@oracle.com
Fixes: 05ea88608d4e13 ("mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

89b1698c 02-Aug-2018 David S. Miller <davem@davemloft.net>

Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net

The BTF conflicts were simple overlapping changes.

The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>


f075faa3 26-Jul-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem.c: prevent queue.status tearing in semop

In order for load/store tearing prevention to work, _all_ accesses to
the variable in question need to be done around READ and WRITE_ONCE()
macros. Ensure everyone does so for q->status variable for
semtimedop().

Link: http://lkml.kernel.org/r/20180717052654.676-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4f089acc 16-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c9c554f2 11-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

alloc_file(): switch to passing O_... flags instead of FMODE_... mode

... so that it could set both ->f_flags and ->f_mode, without callers
having to set ->f_flags manually.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0eb71a9d 17-Jun-2018 NeilBrown <neilb@suse.com>

rhashtable: split rhashtable.h

Due to the use of rhashtables in net namespaces,
rhashtable.h is included in lots of the kernel,
so a small changes can required a large recompilation.
This makes development painful.

This patch splits out rhashtable-types.h which just includes
the major type declarations, and does not include (non-trivial)
inline code. rhashtable.h is no longer included by anything
in the include/ directory.
Common include files only include rhashtable-types.h so a large
recompilation is only triggered when that changes.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

14f28f57 14-Jun-2018 Souptick Joarder <jrdr.linux@gmail.com>

ipc: use new return type vm_fault_t

Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

Link: http://lkml.kernel.org/r/20180425043413.GA21467@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ec67aaa4 14-Jun-2018 Davidlohr Bueso <dave@stgolabs.net>

sysvipc/sem: mitigate semnum index against spectre v1

Both smatch and coverity are reporting potential issues with spectre
variant 1 with the 'semnum' index within the sma->sems array, ie:

ipc/sem.c:388 sem_lock() warn: potential spectre issue 'sma->sems'
ipc/sem.c:641 perform_atomic_semop_slow() warn: potential spectre issue 'sma->sems'
ipc/sem.c:721 perform_atomic_semop() warn: potential spectre issue 'sma->sems'

Avoid any possible speculation by using array_index_nospec() thus
ensuring the semnum value is bounded to [0, sma->sem_nsems). With the
exception of sem_lock() all of these are slowpaths.

Link: http://lkml.kernel.org/r/20180423171131.njs4rfm2yzyeg6do@linux-n805
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

344476e1 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kvmalloc() -> kvmalloc_array()

The kvmalloc() function has a 2-factor argument form, kvmalloc_array(). This
patch replaces cases of:

kvmalloc(a * b, gfp)

with:
kvmalloc_array(a * b, gfp)

as well as handling cases of:

kvmalloc(a * b * c, gfp)

with:

kvmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kvmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kvmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kvmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kvmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kvmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kvmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kvmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kvmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kvmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kvmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kvmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kvmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kvmalloc
+ kvmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kvmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kvmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kvmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kvmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kvmalloc(C1 * C2 * C3, ...)
|
kvmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kvmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kvmalloc(sizeof(THING) * C2, ...)
|
kvmalloc(sizeof(TYPE) * C2, ...)
|
kvmalloc(C1 * C2 * C3, ...)
|
kvmalloc(C1 * C2, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kvmalloc
+ kvmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>

ba252f16 04-Jun-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull time/Y2038 updates from Thomas Gleixner:

- Consolidate SySV IPC UAPI headers

- Convert SySV IPC to the new COMPAT_32BIT_TIME mechanism

- Cleanup the core interfaces and standardize on the ktime_get_* naming
convention.

- Convert the X86 platform ops to timespec64

- Remove the ugly temporary timespec64 hack

* 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86: Convert x86_platform_ops to timespec64
timekeeping: Add more coarse clocktai/boottime interfaces
timekeeping: Add ktime_get_coarse_with_offset
timekeeping: Standardize on ktime_get_*() naming
timekeeping: Clean up ktime_get_real_ts64
timekeeping: Remove timespec64 hack
y2038: ipc: Redirect ipc(SEMTIMEDOP, ...) to compat_ksys_semtimedop
y2038: ipc: Enable COMPAT_32BIT_TIME
y2038: ipc: Use __kernel_timespec
y2038: ipc: Report long times to user space
y2038: ipc: Use ktime_get_real_seconds consistently
y2038: xtensa: Extend sysvipc data structures
y2038: powerpc: Extend sysvipc data structures
y2038: sparc: Extend sysvipc data structures
y2038: parisc: Extend sysvipc data structures
y2038: mips: Extend sysvipc data structures
y2038: arm64: Extend sysvipc compat data structures
y2038: s390: Remove unneeded ipc uapi header files
y2038: ia64: Remove unneeded ipc uapi header files
y2038: alpha: Remove unneeded ipc uapi header files
...


8f89c007 25-May-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc/shm: fix shmat() nil address after round-down when remapping

shmat()'s SHM_REMAP option forbids passing a nil address for; this is in
fact the very first thing we check for. Andrea reported that for
SHM_RND|SHM_REMAP cases we can end up bypassing the initial addr check,
but we need to check again if the address was rounded down to nil. As
of this patch, such cases will return -EINVAL.

Link: http://lkml.kernel.org/r/20180503204934.kk63josdu6u53fbd@linux-n805
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a73ab244 25-May-2018 Davidlohr Bueso <dave@stgolabs.net>

Revert "ipc/shm: Fix shmat mmap nil-page protection"

Patch series "ipc/shm: shmat() fixes around nil-page".

These patches fix two issues reported[1] a while back by Joe and Andrea
around how shmat(2) behaves with nil-page.

The first reverts a commit that it was incorrectly thought that mapping
nil-page (address=0) was a no no with MAP_FIXED. This is not the case,
with the exception of SHM_REMAP; which is address in the second patch.

I chose two patches because it is easier to backport and it explicitly
reverts bogus behaviour. Both patches ought to be in -stable and ltp
testcases need updated (the added testcase around the cve can be
modified to just test for SHM_RND|SHM_REMAP).

[1] lkml.kernel.org/r/20180430172152.nfa564pvgpk3ut7p@linux-n805

This patch (of 2):

Commit 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection")
worked on the idea that we should not be mapping as root addr=0 and
MAP_FIXED. However, it was reported that this scenario is in fact
valid, thus making the patch both bogus and breaks userspace as well.

For example X11's libint10.so relies on shmat(1, SHM_RND) for lowmem
initialization[1].

[1] https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/os-support/linux/int10/linux.c#n347
Link: http://lkml.kernel.org/r/20180503203243.15045-2-dave@stgolabs.net
Fixes: 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection")
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5dc0b152 13-Apr-2018 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Redirect ipc(SEMTIMEDOP, ...) to compat_ksys_semtimedop

32-bit architectures implementing 64BIT_TIME and COMPAT_32BIT_TIME
need to have the traditional semtimedop() behavior with 32-bit timestamps
for sys_ipc() by calling compat_ksys_semtimedop(), while those that
are not yet converted need to keep using ksys_semtimedop() like
64-bit architectures do.

Note that I chose to not implement a new SEMTIMEDOP64 function that
corresponds to the new sys_semtimedop() with 64-bit timeouts. The reason
here is that sys_ipc() should no longer be used for new system calls,
and libc should just call the semtimedop syscall directly.

One open question remain to whether we want to completely avoid the
sys_ipc() system call for architectures that do not yet have all the
individual calls as they get converted to 64-bit time_t. Doing that
would require adding several extra system calls on m68k, mips, powerpc,
s390, sh, sparc, and x86-32.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

b0d17578 13-Apr-2018 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Enable COMPAT_32BIT_TIME

Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
take a timespec argument. After we move 32-bit architectures over to
useing 64-bit time_t based syscalls, we need seperate entry points for
the old 32-bit based interfaces.

This changes the #ifdef guards for the existing 32-bit compat syscalls
to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
enabled on all existing 32-bit architectures.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

21fc538d 13-Apr-2018 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Use __kernel_timespec

This is a preparatation for changing over __kernel_timespec to 64-bit
times, which involves assigning new system call numbers for mq_timedsend(),
mq_timedreceive() and semtimedop() for compatibility with future y2038
proof user space.

The existing ABIs will remain available through compat code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

c2ab975c 28-Apr-2015 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Report long times to user space

The shmid64_ds/semid64_ds/msqid64_ds data structures have been extended
to contain extra fields for storing the upper bits of the time stamps,
this patch does the other half of the job and and fills the new fields on
32-bit architectures as well as 32-bit tasks running on a 64-bit kernel
in compat mode.

There should be no change for native 64-bit tasks.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

2a70b787 12-Apr-2018 Arnd Bergmann <arnd@arndb.de>

y2038: ipc: Use ktime_get_real_seconds consistently

In some places, we still used get_seconds() instead of
ktime_get_real_seconds(), and I'm changing the remaining ones now to
all use ktime_get_real_seconds() so we use the full available range for
timestamps instead of overflowing the 'unsigned long' return value in
year 2106 on 32-bit kernels.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

3f05317d 13-Apr-2018 Eric Biggers <ebiggers@google.com>

ipc/shm: fix use-after-free of shm file via remap_file_pages()

syzbot reported a use-after-free of shm_file_data(file)->file->f_op in
shm_get_unmapped_area(), called via sys_remap_file_pages().

Unfortunately it couldn't generate a reproducer, but I found a bug which
I think caused it. When remap_file_pages() is passed a full System V
shared memory segment, the memory is first unmapped, then a new map is
created using the ->vm_file. Between these steps, the shm ID can be
removed and reused for a new shm segment. But, shm_mmap() only checks
whether the ID is currently valid before calling the underlying file's
->mmap(); it doesn't check whether it was reused. Thus it can use the
wrong underlying file, one that was already freed.

Fix this by making the "outer" shm file (the one that gets put in
->vm_file) hold a reference to the real shm file, and by making
__shm_open() require that the file associated with the shm ID matches
the one associated with the "outer" file.

Taking the reference to the real shm file is needed to fully solve the
problem, since otherwise sfd->file could point to a freed file, which
then could be reallocated for the reused shm ID, causing the wrong shm
segment to be mapped (and without the required permission checks).

Commit 1ac0b6dec656 ("ipc/shm: handle removed segments gracefully in
shm_mmap()") almost fixed this bug, but it didn't go far enough because
it didn't consider the case where the shm ID is reused.

The following program usually reproduces this bug:

#include <stdlib.h>
#include <sys/shm.h>
#include <sys/syscall.h>
#include <unistd.h>

int main()
{
int is_parent = (fork() != 0);
srand(getpid());
for (;;) {
int id = shmget(0xF00F, 4096, IPC_CREAT|0700);
if (is_parent) {
void *addr = shmat(id, NULL, 0);
usleep(rand() % 50);
while (!syscall(__NR_remap_file_pages, addr, 4096, 0, 0, 0));
} else {
usleep(rand() % 50);
shmctl(id, IPC_RMID, NULL);
}
}
}

It causes the following NULL pointer dereference due to a 'struct file'
being used while it's being freed. (I couldn't actually get a KASAN
use-after-free splat like in the syzbot report. But I think it's
possible with this bug; it would just take a more extraordinary race...)

BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
PGD 0 P4D 0
Oops: 0000 [#1] SMP NOPTI
CPU: 9 PID: 258 Comm: syz_ipc Not tainted 4.16.0-05140-gf8cf2f16a7c95 #189
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
RIP: 0010:d_inode include/linux/dcache.h:519 [inline]
RIP: 0010:touch_atime+0x25/0xd0 fs/inode.c:1724
[...]
Call Trace:
file_accessed include/linux/fs.h:2063 [inline]
shmem_mmap+0x25/0x40 mm/shmem.c:2149
call_mmap include/linux/fs.h:1789 [inline]
shm_mmap+0x34/0x80 ipc/shm.c:465
call_mmap include/linux/fs.h:1789 [inline]
mmap_region+0x309/0x5b0 mm/mmap.c:1712
do_mmap+0x294/0x4a0 mm/mmap.c:1483
do_mmap_pgoff include/linux/mm.h:2235 [inline]
SYSC_remap_file_pages mm/mmap.c:2853 [inline]
SyS_remap_file_pages+0x232/0x310 mm/mmap.c:2769
do_syscall_64+0x64/0x1a0 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7

[ebiggers@google.com: add comment]
Link: http://lkml.kernel.org/r/20180410192850.235835-1-ebiggers3@gmail.com
Link: http://lkml.kernel.org/r/20180409043039.28915-1-ebiggers3@gmail.com
Reported-by: syzbot+d11f321e7f1923157eac80aa990b446596f46439@syzkaller.appspotmail.com
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a61fc2cb 10-Apr-2018 Andrew Morton <akpm@linux-foundation.org>

ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops

This was added by the recent "ipc/shm.c: add split function to
shm_vm_ops", but it is not necessary.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

23c8cec8 10-Apr-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc/msg: introduce msgctl(MSG_STAT_ANY)

There is a permission discrepancy when consulting msq ipc object
metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
command. The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the msq metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it. Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs). Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new MSG_STAT_ANY command such that the msq ipc
object permissions are ignored, and only audited instead. In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a280d6dc 10-Apr-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: introduce semctl(SEM_STAT_ANY)

There is a permission discrepancy when consulting shm ipc object
metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
command. The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the sma metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it. Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs). Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new SEM_STAT_ANY command such that the sem ipc
object permissions are ignored, and only audited instead. In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c21a6970 10-Apr-2018 Davidlohr Bueso <dave@stgolabs.net>

ipc/shm: introduce shmctl(SHM_STAT_ANY)

Patch series "sysvipc: introduce STAT_ANY commands", v2.

The following patches adds the discussed (see [1]) new command for shm
as well as for sems and msq as they are subject to the same
discrepancies for ipc object permission checks between the syscall and
via procfs. These new commands are justified in that (1) we are stuck
with this semantics as changing syscall and procfs can break userland;
and (2) some users can benefit from performance (for large amounts of
shm segments, for example) from not having to parse the procfs
interface.

Once merged, I will submit the necesary manpage updates. But I'm thinking
something like:

: diff --git a/man2/shmctl.2 b/man2/shmctl.2
: index 7bb503999941..bb00bbe21a57 100644
: --- a/man2/shmctl.2
: +++ b/man2/shmctl.2
: @@ -41,6 +41,7 @@
: .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
: .\" attaches to a segment that has already been marked for deletion.
: .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
: +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
: .\"
: .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
: .SH NAME
: @@ -242,6 +243,18 @@ However, the
: argument is not a segment identifier, but instead an index into
: the kernel's internal array that maintains information about
: all shared memory segments on the system.
: +.TP
: +.BR SHM_STAT_ANY " (Linux-specific)"
: +Return a
: +.I shmid_ds
: +structure as for
: +.BR SHM_STAT .
: +However, the
: +.I shm_perm.mode
: +is not checked for read access for
: +.IR shmid ,
: +resembing the behaviour of
: +/proc/sysvipc/shm.
: .PP
: The caller can prevent or allow swapping of a shared
: memory segment with the following \fIcmd\fP values:
: @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
: kernel's internal array recording information about all
: shared memory segments.
: (This information can be used with repeated
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
: operations to obtain information about all shared memory segments
: on the system.)
: A successful
: @@ -328,7 +341,7 @@ isn't accessible.
: \fIshmid\fP is not a valid identifier, or \fIcmd\fP
: is not a valid command.
: Or: for a
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
: operation, the index value specified in
: .I shmid
: referred to an array slot that is currently unused.

This patch (of 3):

There is a permission discrepancy when consulting shm ipc object metadata
between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command. The
later does permission checks for the object vs S_IRUGO. As such there can
be cases where EACCESS is returned via syscall but the info is displayed
anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the shm metadata), this behavior goes way back and showing all
the objects regardless of the permissions was most likely an overlook - so
we are stuck with it. Furthermore, modifying either the syscall or the
procfs file can cause userspace programs to break (ie ipcs). Some
applications require getting the procfs info (without root privileges) and
can be rather slow in comparison with a syscall -- up to 500x in some
reported cases.

This patch introduces a new SHM_STAT_ANY command such that the shm ipc
object permissions are ignored, and only audited instead. In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

[1] https://lkml.org/lkml/2017/12/19/220

Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Robert Kettler <robert.kettler@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e74a0eff 10-Apr-2018 Alexey Dobriyan <adobriyan@gmail.com>

proc: move /proc/sysvipc creation to where it belongs

Move the proc_mkdir() call within the sysvipc subsystem such that we
avoid polluting proc_root_init() with petty cpp.

[dave@stgolabs.net: contributed changelog]
Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

17dec0a9 03-Apr-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull namespace updates from Eric Biederman:
"There was a lot of work this cycle fixing bugs that were discovered
after the merge window and getting everything ready where we can
reasonably support fully unprivileged fuse. The bug fixes you already
have and much of the unprivileged fuse work is coming in via other
trees.

Still left for fully unprivileged fuse is figuring out how to cleanly
handle .set_acl and .get_acl in the legacy case, and properly handling
of evm xattrs on unprivileged mounts.

Included in the tree is a cleanup from Alexely that replaced a linked
list with a statically allocated fix sized array for the pid caches,
which simplifies and speeds things up.

Then there is are some cleanups and fixes for the ipc namespace. The
motivation was that in reviewing other code it was discovered that
access ipc objects from different pid namespaces recorded pids in such
a way that when asked the wrong pids were returned. In the worst case
there has been a measured 30% performance impact for sysvipc
semaphores. Other test cases showed no measurable performance impact.
Manfred Spraul and Davidlohr Bueso who tend to work on sysvipc
performance both gave the nod that this is good enough.

Casey Schaufler and James Morris have given their approval to the LSM
side of the changes.

I simplified the types and the code dealing with sysvipc to pass just
kern_ipc_perm for all three types of ipc. Which reduced the header
dependencies throughout the kernel and simplified the lsm code.

Which let me work on the pid fixes without having to worry about
trivial changes causing complete kernel recompiles"

* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ipc/shm: Fix pid freeing.
ipc/shm: fix up for struct file no longer being available in shm.h
ipc/smack: Tidy up from the change in type of the ipc security hooks
ipc: Directly call the security hook in ipc_ops.associate
ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces
ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces
ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces.
ipc/util: Helpers for making the sysvipc operations pid namespace aware
ipc: Move IPCMNI from include/ipc.h into ipc/util.h
msg: Move struct msg_queue into ipc/msg.c
shm: Move struct shmid_kernel into ipc/shm.c
sem: Move struct sem and struct sem_array into ipc/sem.c
msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks
shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks
sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks
pidns: simpler allocation of pid_* caches


642e7fd2 02-Apr-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux

Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
"System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or
compat_sys_xyzzy() should only be called from userspace via the
syscall table, but not from elsewhere in the kernel.

At least on 64-bit x86, it will likely be a hard requirement from
v4.17 onwards to not call system call functions in the kernel: It is
better to use use a different calling convention for system calls
there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
which then hands processing over to the actual syscall function. This
means that only those parameters which are actually needed for a
specific syscall are passed on during syscall entry, instead of
filling in six CPU registers with random user space content all the
time (which may cause serious trouble down the call chain). Those
x86-specific patches will be pushed through the x86 tree in the near
future.

Moreover, rules on how data may be accessed may differ between kernel
data and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific
code.

This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
bpf: whitelist all syscalls for error injection
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
kernel/sys_ni: sort cond_syscall() entries
syscalls/x86: auto-create compat_sys_*() prototypes
syscalls: sort syscall prototypes in include/linux/compat.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
kexec: move sys_kexec_load() prototype to syscalls.h
x86/sigreturn: use SYSCALL_DEFINE0
x86: fix sys_sigreturn() return type to be long, not unsigned long
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
...


31c213f2 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add msgsnd syscall/compat_syscall wrappers

Provide ksys_msgsnd() and compat_ksys_msgsnd() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_msgsnd() and compat_sys_msgsnd().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

078faac9 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add msgrcv syscall/compat_syscall wrappers

Provide ksys_msgrcv() and compat_ksys_msgrcv() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_msgrcv() and compat_sys_msgrcv().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

e340db56 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add msgctl syscall/compat_syscall wrappers

Provide ksys_msgctl() and compat_ksys_msgctl() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_msgctl() and compat_sys_msgctl().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

c84d0791 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add shmctl syscall/compat_syscall wrappers

Provide ksys_shmctl() and compat_ksys_shmctl() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_shmctl() and compat_sys_shmctl().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

da1e2744 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add shmdt syscall wrapper

Provide ksys_shmdt() wrapper to avoid in-kernel calls to this syscall.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_shmdt().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

65749e0b 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add shmget syscall wrapper

Provide ksys_shmget() wrapper to avoid in-kernel calls to this syscall.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_shmget().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

3d65661a 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add msgget syscall wrapper

Provide ksys_msgget() wrapper to avoid in-kernel calls to this syscall.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_msgget().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

d969c6fa 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add semctl syscall/compat_syscall wrappers

Provide ksys_semctl() and compat_ksys_semctl() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_semctl() and compat_sys_semctl().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

69894718 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add semget syscall wrapper

Provide ksys_semget() wrapper to avoid in-kernel calls to this syscall.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_semget().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

41f4f0e2 20-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

ipc: add semtimedop syscall/compat_syscall wrappers

Provide ksys_semtimedop() and compat_ksys_semtimedop() wrappers to avoid
in-kernel calls to these syscalls. The ksys_ prefix denotes that these
functions are meant as a drop-in replacement for the syscalls. In
particular, they use the same calling convention as sys_semtimedop() and
compat_sys_semtimedop().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

3d942ee0 28-Mar-2018 Mike Kravetz <mike.kravetz@oracle.com>

ipc/shm.c: add split function to shm_vm_ops

If System V shmget/shmat operations are used to create a hugetlbfs
backed mapping, it is possible to munmap part of the mapping and split
the underlying vma such that it is not huge page aligned. This will
untimately result in the following BUG:

kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E 4.15.0-10-generic #11-Ubuntu
NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E (4.15.0-10-generic)
MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER: 20040000
CFAR: c00000000036ee44 SOFTE: 1
NIP __unmap_hugepage_range+0xa4/0x760
LR __unmap_hugepage_range_final+0x28/0x50
Call Trace:
0x7115e4e00000 (unreliable)
__unmap_hugepage_range_final+0x28/0x50
unmap_single_vma+0x11c/0x190
unmap_vmas+0x94/0x140
exit_mmap+0x9c/0x1d0
mmput+0xa8/0x1d0
do_exit+0x360/0xc80
do_group_exit+0x60/0x100
SyS_exit_group+0x24/0x30
system_call+0x58/0x6c
---[ end trace ee88f958a1c62605 ]---

This bug was introduced by commit 31383c6865a5 ("mm, hugetlbfs:
introduce ->split() to vm_operations_struct"). A split function was
added to vm_operations_struct to determine if a mapping can be split.
This was mostly for device-dax and hugetlbfs mappings which have
specific alignment constraints.

Mappings initiated via shmget/shmat have their original vm_ops
overwritten with shm_vm_ops. shm_vm_ops functions will call back to the
original vm_ops if needed. Add such a split function to shm_vm_ops.

Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com
Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2236d4d3 28-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

ipc/shm: Fix pid freeing.

The 0day kernel test build report reported an oops:
>
> IP: put_pid+0x22/0x5c
> PGD 19efa067 P4D 19efa067 PUD 0
> Oops: 0000 [#1]
> CPU: 0 PID: 727 Comm: trinity Not tainted 4.16.0-rc2-00010-g98f929b #1
> RIP: 0010:put_pid+0x22/0x5c
> RSP: 0018:ffff986719f73e48 EFLAGS: 00010202
> RAX: 00000006d765f710 RBX: ffff98671a4fa4d0 RCX: ffff986719f73d40
> RDX: 000000006f6e6125 RSI: 0000000000000000 RDI: ffffffffa01e6d21
> RBP: ffffffffa0955fe0 R08: 0000000000000020 R09: 0000000000000000
> R10: 0000000000000078 R11: ffff986719f73e76 R12: 0000000000001000
> R13: 00000000ffffffea R14: 0000000054000fb0 R15: 0000000000000000
> FS: 00000000028c2880(0000) GS:ffffffffa06ad000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000677846439 CR3: 0000000019fc1005 CR4: 00000000000606b0
> Call Trace:
> ? ipc_update_pid+0x36/0x3e
> ? newseg+0x34c/0x3a6
> ? ipcget+0x5d/0x528
> ? entry_SYSCALL_64_after_hwframe+0x52/0xb7
> ? SyS_shmget+0x5a/0x84
> ? do_syscall_64+0x194/0x1b3
> ? entry_SYSCALL_64_after_hwframe+0x42/0xb7
> Code: ff 05 e7 20 9b 03 58 c9 c3 48 ff 05 85 21 9b 03 48 85 ff 74 4f 8b 47 04 8b 17 48 ff 05 7c 21 9b 03 48 83 c0 03 48 c1 e0 04 ff ca <48> 8b 44 07 08 74 1f 48 ff 05 6c 21 9b 03 ff 0f 0f 94 c2 48 ff
> RIP: put_pid+0x22/0x5c RSP: ffff986719f73e48
> CR2: 0000000677846439
> ---[ end trace ab8c5cb4389d37c5 ]---
> Kernel panic - not syncing: Fatal exception

In newseg when changing shm_cprid and shm_lprid from pid_t to struct
pid* I misread the kvmalloc as kvzalloc and thought shp was
initialized to 0. As that is not the case it is not safe to for the
error handling to address shm_cprid and shm_lprid before they are
initialized.

Therefore move the cleanup of shm_cprid and shm_lprid from the no_file
error cleanup path to the no_id error cleanup path. Ensuring that an
early error exit won't cause the oops above.

Reported-by: kernel test robot <fengguang.wu@intel.com>
Reviewed-by: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

50ab44b1 23-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

ipc: Directly call the security hook in ipc_ops.associate

After the last round of cleanups the shm, sem, and msg associate
operations just became trivial wrappers around the appropriate security
method. Simplify things further by just calling the security method
directly.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

51d6f263 23-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces

Today the last process to update a semaphore is remembered and
reported in the pid namespace of that process. If there are processes
in any other pid namespace querying that process id with GETPID the
result will be unusable nonsense as it does not make any
sense in your own pid namespace.

Due to ipc_update_pid I don't think you will be able to get System V
ipc semaphores into a troublesome cache line ping-pong. Using struct
pids from separate process are not a problem because they do not share
a cache line. Using struct pid from different threads of the same
process are unlikely to be a problem as the reference count update
can be avoided.

Further linux futexes are a much better tool for the job of mutual
exclusion between processes than System V semaphores. So I expect
programs that are performance limited by their interprocess mutual
exclusion primitive will be using futexes.

So while it is possible that enhancing the storage of the last
rocess of a System V semaphore from an integer to a struct pid
will cause a performance regression because of the effect
of frequently updating the pid reference count. I don't expect
that to happen in practice.

This change updates semctl(..., GETPID, ...) to return the
process id of the last process to update a semphore inthe
pid namespace of the calling process.

Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

39a4940e 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces

Today msg_lspid and msg_lrpid are remembered in the pid namespace of
the creator and the processes that last send or received a sysvipc
message. If you have processes in multiple pid namespaces that is
just wrong. The process ids reported will not make the least bit of
sense.

This fix is slightly more susceptible to a performance problem than
the related fix for System V shared memory. By definition the pids
are updated by msgsnd and msgrcv, the fast path of System V message
queues. The only concern over the previous implementation is the
incrementing and decrementing of the pid reference count. As that is
the only difference and multiple updates by of the task_tgid by
threads in the same process have been shown in af_unix sockets to
create a cache line ping-pong between cpus of the same processor.

In this case I don't expect cache lines holding pid reference counts
to ping pong between cpus. As senders and receivers update different
pids there is a natural separation there. Further if multiple threads
of the same process either send or receive messages the pid will be
updated to the same value and ipc_update_pid will avoid the reference
count update.

Which means in the common case I expect msg_lspid and msg_lrpid to
remain constant, and reference counts not to be updated when messages
are sent.

In rare cases it may be possible to trigger the issue which was
observed for af_unix sockets, but it will require multiple processes
with multiple threads to be either sending or receiving messages. It
just does not feel likely that anyone would do that in practice.

This change updates msgctl(..., IPC_STAT, ...) to return msg_lspid and
msg_lrpid in the pid namespace of the process calling stat.

This change also updates cat /proc/sysvipc/msg to return print msg_lspid
and msg_lrpid in the pid namespace of the process that opened the proc
file.

Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user")
Reviewed-by: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

98f929b1 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces.

Today shm_cpid and shm_lpid are remembered in the pid namespace of the
creator and the processes that last touched a sysvipc shared memory
segment. If you have processes in multiple pid namespaces that
is just wrong, and I don't know how this has been over-looked for
so long.

As only creation and shared memory attach and shared memory detach
update the pids I do not expect there to be a repeat of the issues
when struct pid was attached to each af_unix skb, which in some
notable cases cut the performance in half. The problem was threads of
the same process updating same struct pid from different cpus causing
the cache line to be highly contended and bounce between cpus.

As creation, attach, and detach are expected to be rare operations for
sysvipc shared memory segments I do not expect that kind of cache line
ping pong to cause probems. In addition because the pid is at a fixed
location in the structure instead of being dynamic on a skb, the
reference count of the pid does not need to be updated on each
operation if the pid is the same. This ability to simply skip the pid
reference count changes if the pid is unchanging further reduces the
likelihood of the a cache line holding a pid reference count
ping-ponging between cpus.

Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user")
Reviewed-by: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

cfb2f6f6 24-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

Revert "mqueue: switch to on-demand creation of internal mount"

This reverts commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21.

Aleksa Sarai <asarai@suse.de> writes:
> [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
>
> Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
> which was introduced by 36735a6a2b5e ("mqueue: switch to on-demand
> creation of internal mount").
>
> Basically, the reproducer boils down to being able to mount mqueue if
> you create a new user namespace, even if you don't unshare the IPC
> namespace.
>
> Previously this was not possible, and you would get an -EPERM. The mount
> is the *host* mqueue mount, which is being cached and just returned from
> mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
> it was intentional -- since I'm not familiar with mqueue).
>
> To me it looks like there is a missing permission check. I've included a
> patch below that I've compile-tested, and should block the above case.
> Can someone please tell me if I'm missing something? Is this actually
> safe?
>
> [1]: https://github.com/docker/docker/issues/36674

The issue is a lot deeper than a missing permission check. sb->s_user_ns
was is improperly set as well. So in addition to the filesystem being
mounted when it should not be mounted, so things are not allow that should
be.

We are practically to the release of 4.16 and there is no agreement between
Al Viro and myself on what the code should looks like to fix things properly.
So revert the code to what it was before so that we can take our time
and discuss this properly.

Fixes: 36735a6a2b5e ("mqueue: switch to on-demand creation of internal mount")
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Reported-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

03f1fc09 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

ipc/util: Helpers for making the sysvipc operations pid namespace aware

Capture the pid namespace when /proc/sysvipc/msg /proc/sysvipc/shm
and /proc/sysvipc/sem are opened, and make it available through
the new helper ipc_seq_pid_ns.

This makes it possible to report the pids in these files in the
pid namespace of the opener of the files.

Implement ipc_update_pid. A simple impline helper that will only update
a struct pid pointer if the new value does not equal the old value. This
removes the need for wordy code sequences like:

old = object->pid;
object->pid = new;
put_pid(old);

and

old = object->pid;
if (old != new) {
object->pid = new;
put_pid(old);
}

Allowing the following to be written instead:

ipc_update_pid(&object->pid, new);

Which is easier to read and ensures that the pid reference count is
not touched the old and the new values are the same. Not touching
the reference count in this case is important to help avoid issues
like af_unix experienced, where multiple threads of the same
process managed to bounce the struct pid between cpu cache lines,
but updating the pids reference count.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

f83a396d 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

ipc: Move IPCMNI from include/ipc.h into ipc/util.h

The definition IPCMNI is only used in ipc/util.h and ipc/util.c. So
there is no reason to keep it in a header file that the whole kernel
can see. Move it into util.h to simplify future maintenance.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

34b56df9 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

msg: Move struct msg_queue into ipc/msg.c

All of the users are now in ipc/msg.c so make the definition local to
that file to make code maintenance easier. AKA to prevent rebuilding
the entire kernel when struct msg_queue changes.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

a2e102cd 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

shm: Move struct shmid_kernel into ipc/shm.c

All of the users are now in ipc/shm.c so make the definition local to
that file to make code maintenance easier. AKA to prevent rebuilding
the entire kernel when struct shmid_kernel changes.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

1a5c1349 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

sem: Move struct sem and struct sem_array into ipc/sem.c

All of the users are now in ipc/sem.c so make the definitions
local to that file to make code maintenance easier. AKA
to prevent rebuilding the entire kernel when one of these
files is changed.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

d8c6e854 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks

All of the implementations of security hooks that take msg_queue only
access q_perm the struct kern_ipc_perm member. This means the
dependencies of the msg_queue security hooks can be simplified by
passing the kern_ipc_perm member of msg_queue.

Making this change will allow struct msg_queue to become private to
ipc/msg.c.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

7191adff 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks

All of the implementations of security hooks that take shmid_kernel only
access shm_perm the struct kern_ipc_perm member. This means the
dependencies of the shm security hooks can be simplified by passing
the kern_ipc_perm member of shmid_kernel..

Making this change will allow struct shmid_kernel to become private to ipc/shm.c.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

aefad959 22-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks

All of the implementations of security hooks that take sem_array only
access sem_perm the struct kern_ipc_perm member. This means the
dependencies of the sem security hooks can be simplified by passing
the kern_ipc_perm member of sem_array.

Making this change will allow struct sem and struct sem_array
to become private to ipc/sem.c.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

a9a08845 11-Feb-2018 Linus Torvalds <torvalds@linux-foundation.org>

vfs: do bulk POLL* -> EPOLL* replacement

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

68e34f4e 06-Feb-2018 Jonathan Haws <jhaws@sdl.usu.edu>

ipc/mqueue.c: have RT tasks queue in by priority in wq_add()

Previous behavior added tasks to the work queue using the static_prio
value instead of the dynamic priority value in prio. This caused RT tasks
to be added to the work queue in a FIFO manner rather than by priority.
Normal tasks were handled by priority.

This fix utilizes the dynamic priority of the task to ensure that both RT
and normal tasks are added to the work queue in priority order. Utilizing
the dynamic priority (prio) rather than the base priority (normal_prio)
was chosen to ensure that if a task had a boosted priority when it was
added to the work queue, it would be woken sooner to to ensure that it
releases any other locks it may be holding in a more timely manner. It is
understood that the task could have a lower priority when it wakes than
when it was added to the queue in this (unlikely) case.

Link: http://lkml.kernel.org/r/1513006652-7014-1-git-send-email-jhaws@sdl.usu.edu
Signed-off-by: Jonathan Haws <jhaws@sdl.usu.edu>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

87ad4b0d 06-Feb-2018 Philippe Mikoyan <philippe.mikoyan@skat.systems>

ipc: fix ipc data structures inconsistency

As described in the title, this patch fixes <ipc>id_ds inconsistency when
<ipc>ctl_stat executes concurrently with some ds-changing function, e.g.
shmat, msgsnd or whatever.

For instance, if shmctl(IPC_STAT) is running concurrently
with shmat, following data structure can be returned:
{... shm_lpid = 0, shm_nattch = 1, ...}

Link: http://lkml.kernel.org/r/20171202153456.6514-1-philippe.mikoyan@skat.systems
Signed-off-by: Philippe Mikoyan <philippe.mikoyan@skat.systems>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8b0fdf63 30-Jan-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.mqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull mqueue/bpf vfs cleanups from Al Viro:
"mqueue and bpf go through rather painful and similar contortions to
create objects in their dentry trees. Provide a primitive for doing
that without abusing ->mknod(), switch bpf and mqueue to it.

Another mqueue-related thing that has ended up in that branch is
on-demand creation of internal mount (based upon the work of Giuseppe
Scrivano)"

* 'work.mqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mqueue: switch to on-demand creation of internal mount
tidy do_mq_open() up a bit
mqueue: clean prepare_open() up
do_mq_open(): move all work prior to dentry_open() into a helper
mqueue: fold mq_attr_ok() into mqueue_get_inode()
move dentry_open() calls up into do_mq_open()
mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata
bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()
new primitive: vfs_mkobj()


168fe32a 30-Jan-2018 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.

Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.

Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.

The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.

As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...


faf1f22b 05-Jan-2018 Eric W. Biederman <ebiederm@xmission.com>

signal: Ensure generic siginfos the kernel sends have all bits initialized

Call clear_siginfo to ensure stack allocated siginfos are fully
initialized before being passed to the signal sending functions.

This ensures that if there is the kind of confusion documented by
TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
data to userspace when the kernel generates a signal with SI_USER but
the copy to userspace assumes it is a different kind of signal, and
different fields are initialized.

This also prepares the way for turning copy_siginfo_to_user
into a copy_to_user, by removing the need in many cases to perform
a field by field copy simply to skip the uninitialized fields.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

36735a6a 25-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: switch to on-demand creation of internal mount

Instead of doing that upon each ipcns creation, we do that the first
time mq_open(2) or mqueue mount is done in an ipcns. What's more,
doing that allows to get rid of mount_ns() use - we can go with
considerably cheaper mount_nodev(), avoiding the loop over all
mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
instance in O(1) time instead of O(instances) mount_ns() would've
cost us.

Based upon the version by Giuseppe Scrivano <gscrivan@redhat.com>; I've
added handling of userland mqueue mounts (original had been broken in
that area) and added a switch to mount_nodev().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

a713fd7f 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

tidy do_mq_open() up a bit

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9b20d7fc 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: clean prepare_open() up

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

066cc813 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

do_mq_open(): move all work prior to dentry_open() into a helper

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

05c1b290 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: fold mq_attr_ok() into mqueue_get_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

af4a5372 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

move dentry_open() calls up into do_mq_open()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

eecec19d 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9dd95748 02-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

ipc, kernel, mm: annotate ->poll() instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1751e8a6 27-Nov-2017 Linus Torvalds <torvalds@linux-foundation.org>

Rename superblock flags (MS_xyz -> SB_xyz)

This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"

SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fa7f5780 17-Nov-2017 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:

- a bit more MM

- procfs updates

- dynamic-debug fixes

- lib/ updates

- checkpatch

- epoll

- nilfs2

- signals

- rapidio

- PID management cleanup and optimization

- kcov updates

- sysvipc updates

- quite a few misc things all over the place

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
EXPERT Kconfig menu: fix broken EXPERT menu
include/asm-generic/topology.h: remove unused parent_node() macro
arch/tile/include/asm/topology.h: remove unused parent_node() macro
arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro
arch/sh/include/asm/topology.h: remove unused parent_node() macro
arch/ia64/include/asm/topology.h: remove unused parent_node() macro
drivers/pcmcia/sa1111_badge4.c: avoid unused function warning
mm: add infrastructure for get_user_pages_fast() benchmarking
sysvipc: make get_maxid O(1) again
sysvipc: properly name ipc_addid() limit parameter
sysvipc: duplicate lock comments wrt ipc_addid()
sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE
initramfs: use time64_t timestamps
drivers/watchdog: make use of devm_register_reboot_notifier()
kernel/reboot.c: add devm_register_reboot_notifier()
kcov: update documentation
Makefile: support flag -fsanitizer-coverage=trace-cmp
kcov: support comparison operands collection
kcov: remove pointless current != NULL check
kernel/panic.c: add TAINT_AUX
...


15df03c8 17-Nov-2017 Davidlohr Bueso <dave@stgolabs.net>

sysvipc: make get_maxid O(1) again

For a custom microbenchmark on a 3.30GHz Xeon SandyBridge, which calls
IPC_STAT over and over, it was calculated that, on avg the cost of
ipc_get_maxid() for increasing amounts of keys was:

10 keys: ~900 cycles
100 keys: ~15000 cycles
1000 keys: ~150000 cycles
10000 keys: ~2100000 cycles

This is unsurprising as maxid is currently O(n).

By having the max_id available in O(1) we save all those cycles for each
semctl(_STAT) command, the idr_find can be expensive -- which some real
(customer) workloads actually poll on.

Note that this used to be the case, until commit 7ca7e564e04 ("ipc:
store ipcs into IDRs"). The cost is the extra idr_find when doing
RMIDs, but we simply go backwards, and should not take too many
iterations to find the new value.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170831172049.14576-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ebf66799 17-Nov-2017 Davidlohr Bueso <dave@stgolabs.net>

sysvipc: properly name ipc_addid() limit parameter

This is better understood as a limit, instead of size; exactly like the
function comment indicates. Rename it.

Link: http://lkml.kernel.org/r/20170831172049.14576-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

39c96a1b 17-Nov-2017 Davidlohr Bueso <dave@stgolabs.net>

sysvipc: duplicate lock comments wrt ipc_addid()

The comment in msgqueues when using ipc_addid() is quite useful imo.
Duplicate it for shm and semaphores.

Link: http://lkml.kernel.org/r/20170831172049.14576-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b8fd9983 17-Nov-2017 Davidlohr Bueso <dave@stgolabs.net>

sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE

Patch series "sysvipc: ipc-key management improvements".

Here are a few improvements I spotted while eyeballing Guillaume's
rhashtable implementation for ipc keys. The first and fourth patches
are the interesting ones, the middle two are trivial.

This patch (of 4):

The next_id object-allocation functionality was introduced in commit
03f595668017 ("ipc: add sysctl to specify desired next object id").

Given that these new entries are _only_ exported under the
CONFIG_CHECKPOINT_RESTORE option, there is no point for the common case
to even know about ->next_id. As such rewrite ipc_buildid() such that
it can do away with the field as well as unnecessary branches when
adding a new identifier. The end result also better differentiates both
cases, so the code ends up being cleaner; albeit the small duplications
regarding the default case.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170831172049.14576-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ca5b857c 17-Nov-2017 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc vfs updates from Al Viro:
"Assorted stuff, really no common topic here"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: grab the lock instead of blocking in __fd_install during resizing
vfs: stop clearing close on exec when closing a fd
include/linux/fs.h: fix comment about struct address_space
fs: make fiemap work from compat_ioctl
coda: fix 'kernel memory exposure attempt' in fsync
pstore: remove unneeded unlikely()
vfs: remove unneeded unlikely()
stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case
make vfs_ustat() static
do_handle_open() should be static
elf_fdpic: fix unused variable warning
fold destroy_super() into __put_super()
new helper: destroy_unused_super()
fix address space warnings in ipc/
acct.h: get rid of detritus


b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

6aa211e8 25-Sep-2017 Linus Torvalds <torvalds@linux-foundation.org>

fix address space warnings in ipc/

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b776e4b1 25-Sep-2017 Al Viro <viro@zeniv.linux.org.uk>

fix a typo in put_compat_shm_info()

"uip" misspelled as "up"; unfortunately, the latter happens to be
a function and gcc is happy to convert it to void *...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

58aff0af 18-Sep-2017 Will Deacon <will@kernel.org>

ipc/shm: Fix order of parameters when calling copy_compat_shmid_to_user

Commit 553f770ef71b ("ipc: move compat shmctl to native") moved the
compat IPC syscall handling into ipc/shm.c and refactored the struct
accessors in the process. Unfortunately, the call to
copy_compat_shmid_to_user when handling a compat {IPC,SHM}_STAT command
gets the arguments the wrong way round, passing a kernel stack address
as the user buffer (destination) and the user buffer as the kernel stack
address (source).

This patch fixes the parameter ordering so the buffers are accessed
correctly.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

cc73fee0 14-Sep-2017 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull ipc compat cleanup and 64-bit time_t from Al Viro:
"IPC copyin/copyout sanitizing, including 64bit time_t work from Deepa
Dinamani"

* 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
utimes: Make utimes y2038 safe
ipc: shm: Make shmid_kernel timestamps y2038 safe
ipc: sem: Make sem_array timestamps y2038 safe
ipc: msg: Make msg_queue timestamps y2038 safe
ipc: mqueue: Replace timespec with timespec64
ipc: Make sys_semtimedop() y2038 safe
get rid of SYSVIPC_COMPAT on ia64
semtimedop(): move compat to native
shmat(2): move compat to native
msgrcv(2), msgsnd(2): move compat to native
ipc(2): move compat to native
ipc: make use of compat ipc_perm helpers
semctl(): move compat to native
semctl(): separate all layout-dependent copyin/copyout
msgctl(): move compat to native
msgctl(): split the actual work from copyin/copyout
ipc: move compat shmctl to native
shmctl: split the work from copyin/copyout


0cfb6aee 08-Sep-2017 Guillaume Knispel <guillaume.knispel@supersonicimagine.com>

ipc: optimize semget/shmget/msgget for lots of keys

ipc_findkey() used to scan all objects to look for the wanted key. This
is slow when using a high number of keys. This change adds an rhashtable
of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).

This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]

Other (more micro) benchmark results, by the author: On an i5 laptop, the
following loop executed right after a reboot took, without and with this
change:

for (int i = 0, k=0x424242; i < KEYS; ++i)
semget(k++, 1, IPC_CREAT | 0600);

total total max single max single
KEYS without with call without call with

1 3.5 4.9 µs 3.5 4.9
10 7.6 8.6 µs 3.7 4.7
32 16.2 15.9 µs 4.3 5.3
100 72.9 41.8 µs 3.7 4.7
1000 5,630.0 502.0 µs * *
10000 1,340,000.0 7,240.0 µs * *
31900 17,600,000.0 22,200.0 µs * *

*: unreliable measure: high variance

The duration for a lookup-only usage was obtained by the same loop once
the keys are present:

total total max single max single
KEYS without with call without call with

1 2.1 2.5 µs 2.1 2.5
10 4.5 4.8 µs 2.2 2.3
32 13.0 10.8 µs 2.3 2.8
100 82.9 25.1 µs * 2.3
1000 5,780.0 217.0 µs * *
10000 1,470,000.0 2,520.0 µs * *
31900 17,400,000.0 7,810.0 µs * *

Finally, executing each semget() in a new process gave, when still
summing only the durations of these syscalls:

creation:
total total
KEYS without with

1 3.7 5.0 µs
10 32.9 36.7 µs
32 125.0 109.0 µs
100 523.0 353.0 µs
1000 20,300.0 3,280.0 µs
10000 2,470,000.0 46,700.0 µs
31900 27,800,000.0 219,000.0 µs

lookup-only:
total total
KEYS without with

1 2.5 2.7 µs
10 25.4 24.4 µs
32 106.0 72.6 µs
100 591.0 352.0 µs
1000 22,400.0 2,250.0 µs
10000 2,510,000.0 25,700.0 µs
31900 28,200,000.0 115,000.0 µs

[1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop

Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix
Signed-off-by: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Reviewed-by: Marc Pardo <marc.pardo@supersonicimagine.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Cc: Marc Pardo <marc.pardo@supersonicimagine.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e4243b80 08-Sep-2017 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: play nicer with large nsops allocations

Replacing semop()'s kmalloc for kvmalloc was originally proposed by
Manfred on the premise that it can be called for large (than order-1)
sizes. For example, while Oracle recommends setting SEMOPM to a _minimum_
of 100, some distros[1] encourage the setting to be a factor of the amount
of db tasks (PROCESSES), which can get fishy for large systems (easily
going beyond 1000).

[1] An Example of Semaphore Settings
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-An_Example_of_Semaphore_Settings.html

So let's just convert this to kvmalloc, just like the rest of the
allocations we do in ipc. While the fallback vmalloc obviously involves
more overhead, this by far the uncommon path, and it's better for the user
than just erroring out with kmalloc.

Link: http://lkml.kernel.org/r/20170803184136.13855-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8419e64a 08-Sep-2017 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: drop sem_checkid helper

... 'tis not used.

Link: http://lkml.kernel.org/r/20170803184136.13855-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9405c03e 08-Sep-2017 Elena Reshetova <elena.reshetova@intel.com>

ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t

refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This allows to avoid
accidental refcounter overflows that might lead to use-after-free
situations.

Link: http://lkml.kernel.org/r/1499417992-3238-4-git-send-email-elena.reshetova@intel.com
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: <arozansk@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f74370b8 08-Sep-2017 Elena Reshetova <elena.reshetova@intel.com>

ipc: convert sem_undo_list.refcnt from atomic_t to refcount_t

refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This allows to avoid
accidental refcounter overflows that might lead to use-after-free
situations.

Link: http://lkml.kernel.org/r/1499417992-3238-3-git-send-email-elena.reshetova@intel.com
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: <arozansk@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a2e0602c 08-Sep-2017 Elena Reshetova <elena.reshetova@intel.com>

ipc: convert ipc_namespace.count from atomic_t to refcount_t

refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This allows to avoid
accidental refcounter overflows that might lead to use-after-free
situations.

Link: http://lkml.kernel.org/r/1499417992-3238-2-git-send-email-elena.reshetova@intel.com
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: <arozansk@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7ff2819e 02-Aug-2017 Deepa Dinamani <deepa.kernel@gmail.com>

ipc: shm: Make shmid_kernel timestamps y2038 safe

time_t is not y2038 safe. Replace all uses of
time_t by y2038 safe time64_t.

Similarly, replace the calls to get_seconds() with
y2038 safe ktime_get_real_seconds().
Note that this preserves fast access on 64 bit systems,
but 32 bit systems need sequence counters.

The syscall interfaces themselves are not changed as part of
the patch. They will be part of a different series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

e54d02b2 02-Aug-2017 Deepa Dinamani <deepa.kernel@gmail.com>

ipc: sem: Make sem_array timestamps y2038 safe

time_t is not y2038 safe. Replace all uses of
time_t by y2038 safe time64_t.

Similarly, replace the calls to get_seconds() with
y2038 safe ktime_get_real_seconds().
Note that this preserves fast access on 64 bit systems,
but 32 bit systems need sequence counters.

The syscall interface themselves are not changed as part of
the patch. They will be part of a different series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

50578ea9 02-Aug-2017 Deepa Dinamani <deepa.kernel@gmail.com>

ipc: msg: Make msg_queue timestamps y2038 safe

time_t is not y2038 safe. Replace all uses of
time_t by y2038 safe time64_t.

Similarly, replace the calls to get_seconds() with
y2038 safe ktime_get_real_seconds().
Note that this preserves fast access on 64 bit systems,
but 32 bit systems need sequence counters.

The syscall interfaces themselves are not changed as part of
the patch. They will be part of a different series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b9047726 02-Aug-2017 Deepa Dinamani <deepa.kernel@gmail.com>

ipc: mqueue: Replace timespec with timespec64

struct timespec is not y2038 safe. Replace
all uses of timespec by y2038 safe struct timespec64.

Even though timespec is used here to represent timeouts,
replace these with timespec64 so that it facilitates
in verification by creating a y2038 safe kernel image
that is free of timespec.

The syscall interfaces themselves are not changed as part
of the patch. They will be part of a different series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3ef56dc2 02-Aug-2017 Deepa Dinamani <deepa.kernel@gmail.com>

ipc: Make sys_semtimedop() y2038 safe

struct timespec is not y2038 safe on 32 bit machines.
Replace timespec with y2038 safe struct timespec64.

Note that the patch only changes the internals without
modifying the syscall interface. This will be part
of a separate series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

94edf6f3 21-Aug-2017 Ingo Molnar <mingo@kernel.org>

Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu

Pull RCU updates from Paul E. McKenney:

- Removal of spin_unlock_wait()
- SRCU updates
- Torture-test updates
- Documentation updates
- Miscellaneous fixes
- CPU-hotplug fixes
- Miscellaneous non-RCU fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>


e0892e08 29-Jun-2017 Paul E. McKenney <paulmck@kernel.org>

ipc: Replace spin_unlock_wait() with lock/unlock pair

There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair. This commit therefore replaces the spin_unlock_wait() call in
exit_sem() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because exit_sem()
is rarely invoked in production.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Manfred Spraul <manfred@colorfullife.com>

ade9f91b 02-Aug-2017 Kees Cook <keescook@chromium.org>

ipc: add missing container_of()s for randstruct

When building with the randstruct gcc plugin, the layout of the IPC
structs will be randomized, which requires any sub-structure accesses to
use container_of(). The proc display handlers were missing the needed
container_of()s since the iterator is passing in the top-level struct
kern_ipc_perm.

This would lead to crashes when running the "lsipc" program after the
system had IPC registered (e.g. after starting up Gnome):

general protection fault: 0000 [#1] PREEMPT SMP
...
RIP: 0010:shm_add_rss_swap.isra.1+0x13/0xa0
...
Call Trace:
sysvipc_shm_proc_show+0x5e/0x150
sysvipc_proc_show+0x1a/0x30
seq_read+0x2e9/0x3f0
...

Link: http://lkml.kernel.org/r/20170730205950.GA55841@beast
Fixes: 3859a271a003 ("randstruct: Mark various structs for randomization")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

44ee4546 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

semtimedop(): move compat to native

... and finally kill the sodding compat_convert_timespec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

a78ee9ed 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

shmat(2): move compat to native

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9b1404c2 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

msgrcv(2), msgsnd(2): move compat to native

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

20bc2a3a 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

ipc(2): move compat to native

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

28327fae 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

ipc: make use of compat ipc_perm helpers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c0ebccb6 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

semctl(): move compat to native

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

45a4a64a 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

semctl(): separate all layout-dependent copyin/copyout

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

46939168 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

msgctl(): move compat to native

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

156d9ed1 09-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

msgctl(): split the actual work from copyin/copyout

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

553f770e 08-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

ipc: move compat shmctl to native

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9ba720c1 08-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

shmctl: split the work from copyin/copyout

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

62b49c99 12-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/util.h: update documentation for ipc_getref() and ipc_putref()

Now that ipc_rcu_alloc() and ipc_rcu_free() are removed, document when
it is valid to use ipc_getref() and ipc_putref().

Link: http://lkml.kernel.org/r/20170525185107.12869-21-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e2029dfe 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/sem: drop __sem_free()

The remaining users of __sem_free() can simply call kvfree() instead for
better readability.

[manfred@colorfullife.com: Rediff to keep rcu protection for security_sem_alloc()]
Link: http://lkml.kernel.org/r/20170525185107.12869-20-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fb259c31 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/msg: remove special msg_alloc/free

There is nothing special about the msg_alloc/free routines any more, so
remove them to make code more readable.

[manfred@colorfullife.com: Rediff to keep rcu protection for security_msg_queue_alloc()]
Link: http://lkml.kernel.org/r/20170525185107.12869-19-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

42e618f7 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/shm: remove special shm_alloc/free

There is nothing special about the shm_alloc/free routines any more, so
remove them to make code more readable.

[manfred@colorfullife.com: Rediff, to continue to keep rcu for free calls after a successful security_shm_alloc()]
Link: http://lkml.kernel.org/r/20170525185107.12869-18-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3d3653f9 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc: move atomic_set() to where it is needed

Only after ipc_addid() has succeeded will refcounting be used, so move
initialization into ipc_addid() and remove from open-coded *_alloc()
routines.

Link: http://lkml.kernel.org/r/20170525185107.12869-17-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

51c23b7b 12-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/msg.c: avoid ipc_rcu_putref for failed ipc_addid()

Loosely based on a patch from Kees Cook <keescook@chromium.org>:
- id and retval can be merged
- if ipc_addid() fails, then use call_rcu() directly.

The difference is that call_rcu is used for failed ipc_addid() calls, to
continue to guaranteed an rcu delay for security_msg_queue_free().

Link: http://lkml.kernel.org/r/20170525185107.12869-16-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a2642f87 12-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/shm.c: avoid ipc_rcu_putref for failed ipc_addid()

Loosely based on a patch from Kees Cook <keescook@chromium.org>:
- id and error can be merged
- if operations before ipc_addid() fail, then use call_rcu() directly.

The difference is that call_rcu is used for failures after
security_shm_alloc(), to continue to guaranteed an rcu delay for
security_sem_free().

Link: http://lkml.kernel.org/r/20170525185107.12869-15-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2ec55f80 12-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: avoid ipc_rcu_putref for failed ipc_addid()

Loosely based on a patch from Kees Cook <keescook@chromium.org>:
- id and retval can be merged
- if ipc_addid() fails, then use call_rcu() directly.

The difference is that call_rcu is used for failed ipc_addid() calls, to
continue to guaranteed an rcu delay for security_sem_free().

Link: http://lkml.kernel.org/r/20170525185107.12869-14-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c3f6fb6f 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/util: drop ipc_rcu_alloc()

No callers remain for ipc_rcu_alloc(). Drop the function.

[manfred@colorfullife.com: Rediff because the memset was temporarily inside ipc_rcu_free()]
Link: http://lkml.kernel.org/r/20170525185107.12869-13-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

52f90890 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/msg: avoid ipc_rcu_alloc()

Instead of using ipc_rcu_alloc() which only performs the refcount bump,
open code it. This also allows for msg_queue structure layout to be
randomized in the future.

Link: http://lkml.kernel.org/r/20170525185107.12869-12-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3e0c2404 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/shm: avoid ipc_rcu_alloc()

Instead of using ipc_rcu_alloc() which only performs the refcount bump,
open code it. This also allows for shmid_kernel structure layout to be
randomized in the future.

Link: http://lkml.kernel.org/r/20170525185107.12869-11-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

101ede01 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/sem: avoid ipc_rcu_alloc()

Instead of using ipc_rcu_alloc() which only performs the refcount bump,
open code it to perform better sem-specific checks. This also allows
for sem_array structure layout to be randomized in the future.

[manfred@colorfullife.com: Rediff, because the memset was temporarily inside ipc_rcu_alloc()]
Link: http://lkml.kernel.org/r/20170525185107.12869-10-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5ccc8fb5 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/util: drop ipc_rcu_free()

There are no more callers of ipc_rcu_free(), so remove it.

Link: http://lkml.kernel.org/r/20170525185107.12869-9-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9ef5932f 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/msg: do not use ipc_rcu_free()

Avoid using ipc_rcu_free, since it just re-finds the original structure
pointer. For the pre-list-init failure path, there is no RCU needed,
since it was just allocated. It can be directly freed.

Link: http://lkml.kernel.org/r/20170525185107.12869-8-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

66470b18 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/shm: do not use ipc_rcu_free()

Avoid using ipc_rcu_free, since it just re-finds the original structure
pointer. For the pre-list-init failure path, there is no RCU needed,
since it was just allocated. It can be directly freed.

Link: http://lkml.kernel.org/r/20170525185107.12869-7-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1b4654ef 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc/sem: do not use ipc_rcu_free()

Avoid using ipc_rcu_free, since it just re-finds the original structure
pointer. For the pre-list-init failure path, there is no RCU needed,
since it was just allocated. It can be directly freed.

Link: http://lkml.kernel.org/r/20170525185107.12869-6-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f8dbe8d2 12-Jul-2017 Kees Cook <keescook@chromium.org>

ipc: drop non-RCU allocation

The only users of ipc_alloc() were ipc_rcu_alloc() and the on-heap
sem_io fall-back memory. Better to just open-code these to make things
easier to read.

[manfred@colorfullife.com: Rediff due to inclusion of memset() into ipc_rcu_alloc()]
Link: http://lkml.kernel.org/r/20170525185107.12869-5-manfred@colorfullife.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dba4cdd3 12-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

ipc: merge ipc_rcu and kern_ipc_perm

ipc has two management structures that exist for every id:
- struct kern_ipc_perm, it contains e.g. the permissions.
- struct ipc_rcu, it contains the rcu head for rcu handling and the
refcount.

The patch merges both structures.

As a bonus, we may save one cacheline, because both structures are
cacheline aligned. In addition, it reduces the number of casts, instead
most codepaths can use container_of.

To simplify code, the ipc_rcu_alloc initializes the allocation to 0.

[manfred@colorfullife.com: really include the memset() into ipc_alloc_rcu()]
Link: http://lkml.kernel.org/r/564f8612-0601-b267-514f-a9f650ec9b32@colorfullife.com
Link: http://lkml.kernel.org/r/20170525185107.12869-3-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1a233956 12-Jul-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: remove sem_base, embed struct sem

sma->sem_base is initialized with

sma->sem_base = (struct sem *) &sma[1];

The current code has four problems:
- There is an unnecessary pointer dereference - sem_base is not needed.
- Alignment for struct sem only works by chance.
- The current code causes false positive for static code analysis.
- This is a cast between different non-void types, which the future
randstruct GCC plugin warns on.

And, as bonus, the code size gets smaller:

Before:
0 .text 00003770
After:
0 .text 0000374e

[manfred@colorfullife.com: s/[0]/[]/, per hch]
Link: http://lkml.kernel.org/r/20170525185107.12869-2-manfred@colorfullife.com
Link: http://lkml.kernel.org/r/20170515171912.6298-2-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: <1vier1@web.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f991af3d 09-Jul-2017 Cong Wang <xiyou.wangcong@gmail.com>

mqueue: fix a use-after-free in sys_mq_notify()

The retry logic for netlink_attachskb() inside sys_mq_notify()
is nasty and vulnerable:

1) The sock refcnt is already released when retry is needed
2) The fd is controllable by user-space because we already
release the file refcnt

so we when retry but the fd has been just closed by user-space
during this small window, we end up calling netlink_detachskb()
on the error path which releases the sock again, later when
the user-space closes this socket a use-after-free could be
triggered.

Setting 'sock' to NULL here should be sufficient to fix it.

Reported-by: GeneBlue <geneblue.mail@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

33198c16 07-Jul-2017 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling fixes from Jeff Layton:
"The main rationale for all of these changes is to tighten up writeback
error reporting to userland. There are many ways now that writeback
errors can be lost, such that fsync/fdatasync/msync return 0 when
writeback actually failed.

This pile contains a small set of cleanups and writeback error
handling fixes that I was able to break off from the main pile (#2).

Two of the patches in this pile are trivial. The exceptions are the
patch to fix up error handling in write_one_page, and the patch to
make JFS pay attention to write_one_page errors"

* tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: remove call_fsync helper function
mm: clean up error handling in write_one_page
JFS: do not ignore return code from write_one_page()
mm: drop "wait" parameter from write_one_page()


0f41074a 05-Jul-2017 Jeff Layton <jlayton@kernel.org>

fs: remove call_fsync helper function

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>

0d060606 27-Jun-2017 Al Viro <viro@zeniv.linux.org.uk>

mqueue: move compat syscalls to native ones

... and stop messing with compat_alloc_user_space() and friends

[braino fix from Colin King folded in]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

a7c3e901 08-May-2017 Michal Hocko <mhocko@suse.com>

mm: introduce kv[mz]alloc helpers

Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree. Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead. This is patch 6. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL). Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers. Existing ones would have to die
slowly.

[sfr@canb.auug.org.au: f2fs fixup]
Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f0cb8802 08-May-2017 Davidlohr Bueso <dave@stgolabs.net>

ipc/shm: some shmat cleanups

Clean up early flag and address some minutia.

Link: http://lkml.kernel.org/r/1486673582-6979-3-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e579dde6 05-May-2017 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull namespace updates from Eric Biederman:
"This is a set of small fixes that were mostly stumbled over during
more significant development. This proc fix and the fix to
posix-timers are the most significant of the lot.

There is a lot of good development going on but unfortunately it
didn't quite make the merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Fix unbalanced hard link numbers
signal: Make kill_proc_info static
rlimit: Properly call security_task_setrlimit
signal: Remove unused definition of sig_user_definied
ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
ipc: Remove unused declaration of recompute_msgmni
posix-timers: Correct sanity check in posix_cpu_nsleep
sysctl: Remove dead register_sysctl_root


2bcb9883 08-Aug-2016 Eric W. Biederman <ebiederm@xmission.com>

ipc: Remove unused declaration of recompute_msgmni

The function recompute_msgmni was removed a while ago
but it is still declared in a header file remove it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

0e056eb5 30-Mar-2017 Mauro Carvalho Chehab <mchehab@kernel.org>

kernel-api.rst: fix a series of errors when parsing C files

./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
./mm/filemap.c:1283: ERROR: Unexpected indentation.
./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
./ipc/util.c:676: ERROR: Unexpected indentation.
./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
./security/security.c:109: ERROR: Unexpected indentation.
./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
./ipc/util.c:477: ERROR: Unknown target name: "s".

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

1827adb1 03-Mar-2017 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull sched.h split-up from Ingo Molnar:
"The point of these changes is to significantly reduce the
<linux/sched.h> header footprint, to speed up the kernel build and to
have a cleaner header structure.

After these changes the new <linux/sched.h>'s typical preprocessed
size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
lines), which is around 40% faster to build on typical configs.

Not much changed from the last version (-v2) posted three weeks ago: I
eliminated quirks, backmerged fixes plus I rebased it to an upstream
SHA1 from yesterday that includes most changes queued up in -next plus
all sched.h changes that were pending from Andrew.

I've re-tested the series both on x86 and on cross-arch defconfigs,
and did a bisectability test at a number of random points.

I tried to test as many build configurations as possible, but some
build breakage is probably still left - but it should be mostly
limited to architectures that have no cross-compiler binaries
available on kernel.org, and non-default configurations"

* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
sched/headers: Clean up <linux/sched.h>
sched/headers: Remove #ifdefs from <linux/sched.h>
sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
sched/core: Remove unused prefetch_stack()
sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
sched/headers: Remove <linux/signal.h> from <linux/sched.h>
sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
sched/headers: Remove the runqueue_is_locked() prototype
sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
...


94e877d0 02-Mar-2017 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs pile two from Al Viro:

- orangefs fix

- series of fs/namei.c cleanups from me

- VFS stuff coming from overlayfs tree

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
orangefs: Use RCU for destroy_inode
vfs: use helper for calling f_op->fsync()
mm: use helper for calling f_op->mmap()
vfs: use helpers for calling f_op->{read,write}_iter()
vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
vfs: extract common parts of {compat_,}do_readv_writev()
vfs: wrap write f_ops with file_{start,end}_write()
vfs: deny copy_file_range() for non regular files
vfs: deny fallocate() on directory
vfs: create vfs helper vfs_tmpfile()
namei.c: split unlazy_walk()
namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
lookup_fast(): clean up the logics around the fallback to non-rcu mode
namei: fold unlazy_link() into its sole caller


653a7746 02-Mar-2017 Al Viro <viro@zeniv.linux.org.uk>

Merge remote-tracking branch 'ovl/for-viro' into for-linus

Overlayfs-related series from Miklos and Amir


eb61baf6 01-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Move the wake-queue types and interfaces from sched.h into <linux/sched/wake_q.h>

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

f719ff9b 06-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h>

But first update the code that uses these facilities with the
new header.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

5b825c3a 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>

Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

8703e8a4 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h>

We are going to split <linux/sched/user.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/user.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

3f07c014 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>

We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

84f001e1 01-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/wake_q.h>

We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/wake_q.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>

95e91b83 27-Feb-2017 Davidlohr Bueso <dave@stgolabs.net>

ipc/shm: Fix shmat mmap nil-page protection

The issue is described here, with a nice testcase:

https://bugzilla.kernel.org/show_bug.cgi?id=192931

The problem is that shmat() calls do_mmap_pgoff() with MAP_FIXED, and
the address rounded down to 0. For the regular mmap case, the
protection mentioned above is that the kernel gets to generate the
address -- arch_get_unmapped_area() will always check for MAP_FIXED and
return that address. So by the time we do security_mmap_addr(0) things
get funky for shmat().

The testcase itself shows that while a regular user crashes, root will
not have a problem attaching a nil-page. There are two possible fixes
to this. The first, and which this patch does, is to simply allow root
to crash as well -- this is also regular mmap behavior, ie when hacking
up the testcase and adding mmap(... |MAP_FIXED). While this approach
is the safer option, the second alternative is to ignore SHM_RND if the
rounded address is 0, thus only having MAP_SHARED flags. This makes the
behavior of shmat() identical to the mmap() case. The downside of this
is obviously user visible, but does make sense in that it maintains
semantics after the round-down wrt 0 address and mmap.

Passes shm related ltp tests.

Link: http://lkml.kernel.org/r/1486050195-18629-1-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Gareth Evans <gareth.evans@contextis.co.uk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

eac0b1c3 27-Feb-2017 Luc Van Oostenryck <luc.vanoostenryck@gmail.com>

ipc/mqueue: add missing sparse annotation

Link: http://lkml.kernel.org/r/20170128235704.45302-1-luc.vanoostenryck@gmail.com
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9de5ab8a 27-Feb-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem: add hysteresis

sysv sem has two lock modes: One with per-semaphore locks, one lock mode
with a single global lock for the whole array. When switching from the
per-semaphore locks to the global lock, all per-semaphore locks must be
scanned for ongoing operations.

The patch adds a hysteresis for switching from the global lock to the
per semaphore locks. This reduces how often the per-semaphore locks
must be scanned.

Compared to the initial patch, this is a simplified solution: Setting
USE_GLOBAL_LOCK_HYSTERESIS to 1 restores the current behavior.

In theory, a workload with exactly 10 simple sops and then one complex
op now scales a bit worse, but this is pure theory: If there is
concurrency, the it won't be exactly 10:1:10:1:10:1:... If there is no
concurrency, then there is no need for scalability.

Link: http://lkml.kernel.org/r/1476851896-3590-3-git-send-email-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: <1vier1@web.de>
Cc: kernel test robot <xiaolong.ye@intel.com>
Cc: <felixh@informatik.uni-bremen.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

27d7be18 27-Feb-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: avoid using spin_unlock_wait()

a) The ACQUIRE in spin_lock() applies to the read, not to the store, at
least for powerpc. This forces to add a smp_mb() into the fast path.

b) The memory barrier provided by spin_unlock_wait() is right now arch
dependent.

Therefore: Use spin_lock()/spin_unlock() instead of spin_unlock_wait().

Advantage: faster single op semop calls(), observed +8.9% on x86. (the
other solution would be arch dependencies in ipc/sem).

Disadvantage: slower complex op semop calls, if (and only if) there are
no sleeping operations.

The next patch adds hysteresis, this further reduces the probability
that the slow path is used.

Link: http://lkml.kernel.org/r/1476851896-3590-2-git-send-email-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: <1vier1@web.de>
Cc: kernel test robot <xiaolong.ye@intel.com>
Cc: <felixh@informatik.uni-bremen.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

897ab3e0 24-Feb-2017 Mike Rapoport <rppt@linux.vnet.ibm.com>

userfaultfd: non-cooperative: add event for memory unmaps

When a non-cooperative userfaultfd monitor copies pages in the
background, it may encounter regions that were already unmapped.
Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
changes in the virtual memory layout.

Since there might be different uffd contexts for the affected VMAs, we
first should create a temporary representation for the unmap event for
each uffd context and then notify them one by one to the appropriate
userfault file descriptors.

The event notification occurs after the mmap_sem has been released.

[arnd@arndb.de: fix nommu build]
Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
[mhocko@suse.com: fix nommu build]
Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

11bac800 24-Feb-2017 Dave Jiang <dave.jiang@intel.com>

mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf

->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0eb8af49 20-Feb-2017 Miklos Szeredi <mszeredi@redhat.com>

vfs: use helper for calling f_op->fsync()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

f74ac015 20-Feb-2017 Miklos Szeredi <mszeredi@redhat.com>

mm: use helper for calling f_op->mmap()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

c626bc46 10-Jan-2017 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix incorrect sem_lock pairing

Based on the syzcaller test case from dvyukov:

https://gist.githubusercontent.com/dvyukov/d0e5efefe4d7d6daed829f5c3ca26a40/raw/08d0a261fe3c987bed04fbf267e08ba04bd533ea/gistfile1.txt

The slow (i.e.: failure to acquire) syscall exit from semtimedop()
incorrectly assumed that the the same lock is acquired as it was at the
initial syscall entry.

This is wrong:
- thread A: single semop semop(), sleeps
- thread B: multi semop semop(), sleeps
- thread A: woken up by signal/timeout

With this sequence, the initial sem_lock() call locks the per-semaphore
spinlock, and it is unlocked with sem_unlock(). The call at the syscall
return locks the global spinlock. Because locknum is not updated, the
following sem_unlock() call unlocks the per-semaphore spinlock, which is
actually not locked.

The fix is trivial: Use the return value from sem_lock.

Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop")
Link: http://lkml.kernel.org/r/1482215645-22328-1-git-send-email-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Johanna Abrahamsson <johanna@mjao.org>
Tested-by: Johanna Abrahamsson <johanna@mjao.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

370b262c 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: avoid idr tree lookup for interrupted semop

We can avoid the idr tree lookup (albeit possibly avoiding
idr_find_fast()) when being awoken in EINTR, as the semid will not
change in this context while blocked. Use the sma pointer directly and
take the sem_lock, then re-check for RMID races. We continue to
re-check the queue.status with the lock held such that we can detect
situations where we where are dealing with a spurious wakeup but another
task that holds the sem_lock updated the queue.status while we were
spinning for it. Once we take the lock it obviously won't change again.

Being the only caller, get rid of sem_obtain_lock() altogether.

Link: http://lkml.kernel.org/r/1478708774-28826-3-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b5fa01a2 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: simplify wait-wake loop

Instead of using the reverse goto, we can simplify the flow and make it
more language natural by just doing do-while instead. One would hope
this is the standard way (or obviously just with a while bucle) that we
do wait/wakeup handling in the kernel. The exact same logic is kept,
just more indented.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1478708774-28826-2-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f150f02c 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: use proper list api for pending_list wakeups

... saves some LoC and looks cleaner than re-implementing the calls.

Link: http://lkml.kernel.org/r/1474225896-10066-6-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4663d3e8 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: explicitly inline check_restart

The compiler already does this, but make it explicit. This helper is
really small and also used in update_queue's main loop, which is O(N^2)
scanning. Inline and avoid the function overhead.

Link: http://lkml.kernel.org/r/1474225896-10066-5-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4ce33ec2 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: optimize perform_atomic_semop()

This is the main workhorse that deals with semop user calls such that
the waitforzero or semval update operations, on the set, can complete on
not as the sma currently stands. Currently, the set is iterated twice
(setting semval, then backwards for the sempid value). Slowpaths, and
particularly SEM_UNDO calls, must undo any altered sem when it is
detected that the caller must block or has errored-out.

With larger sets, there can occur situations where this involves a lot
of cycles and can obviously be a suboptimal use of cached resources in
shared memory. Ie, discarding CPU caches that are also calling semop
and have the sembuf cached (and can complete), while the current lock
holder doing the semop will block, error, or does a waitforzero
operation.

This patch proposes still iterating the set twice, but the first scan is
read-only, and we perform the actual updates afterward, once we know
that the call will succeed. In order to not suffer from the overhead of
dealing with sops that act on the same sem_num, such (rare) cases use
perform_atomic_semop_slow(), which is exactly what we have now.
Duplicates are detected before grabbing sem_lock, and uses simple a
32/64-bit hash array variable to based on the sem_num we are working on.

In addition add some comments to when we expect to the caller to block.

[akpm@linux-foundation.org: coding-style fixes]
[colin.king@canonical.com: ensure we left shift a ULL rather than a 32 bit integer]
Link: http://lkml.kernel.org/r/20161028181129.7311-1-colin.king@canonical.com
Link: http://lkml.kernel.org/r/20160921194603.GB21438@linux-80c1.suse
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9ae949fa 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: rework task wakeups

Our sysv sems have been using the notion of lockless wakeups for a
while, ever since commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process
out of the spinlock section"), in order to reduce the sem_lock hold
times. This in-house pending queue can be replaced by wake_q (just like
all the rest of ipc now), in that it provides the following advantages:

o Simplifies and gets rid of unnecessary code.

o We get rid of the IN_WAKEUP complexities. Given that wake_q_add()
grabs reference to the task, if awoken due to an unrelated event,
between the wake_q_add() and wake_up_q() window, we cannot race with
sys_exit and the imminent call to wake_up_process().

o By not spinning IN_WAKEUP, we no longer need to disable preemption.

In consequence, the wakeup paths (after schedule(), that is) must
acknowledge an external signal/event, as well spurious wakeup occurring
during the pending wakeup window. Obviously no changes in semantics
that could be visible to the user. The fastpath is _only_ for when we
know for sure that we were awoken due to a the waker's successful semop
call (queue.status is not -EINTR).

On a 48-core Haswell, running the ipcscale 'waitforzero' test, the
following is seen with increasing thread counts:

v4.8-rc5 v4.8-rc5
semopv2
Hmean sembench-sem-2 574733.00 ( 0.00%) 578322.00 ( 0.62%)
Hmean sembench-sem-8 811708.00 ( 0.00%) 824689.00 ( 1.59%)
Hmean sembench-sem-12 842448.00 ( 0.00%) 845409.00 ( 0.35%)
Hmean sembench-sem-21 933003.00 ( 0.00%) 977748.00 ( 4.80%)
Hmean sembench-sem-48 935910.00 ( 0.00%) 1004759.00 ( 7.36%)
Hmean sembench-sem-79 937186.00 ( 0.00%) 983976.00 ( 4.99%)
Hmean sembench-sem-234 974256.00 ( 0.00%) 1060294.00 ( 8.83%)
Hmean sembench-sem-265 975468.00 ( 0.00%) 1016243.00 ( 4.18%)
Hmean sembench-sem-296 991280.00 ( 0.00%) 1042659.00 ( 5.18%)
Hmean sembench-sem-327 975415.00 ( 0.00%) 1029977.00 ( 5.59%)
Hmean sembench-sem-358 1014286.00 ( 0.00%) 1049624.00 ( 3.48%)
Hmean sembench-sem-389 972939.00 ( 0.00%) 1043127.00 ( 7.21%)
Hmean sembench-sem-420 981909.00 ( 0.00%) 1056747.00 ( 7.62%)
Hmean sembench-sem-451 990139.00 ( 0.00%) 1051609.00 ( 6.21%)
Hmean sembench-sem-482 965735.00 ( 0.00%) 1040313.00 ( 7.72%)

[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: merge fix for WAKE_Q to DEFINE_WAKE_Q rename]
Link: http://lkml.kernel.org/r/20161122210410.5eca9fc2@canb.auug.org.au
Link: http://lkml.kernel.org/r/1474225896-10066-3-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

248e7357 14-Dec-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: do not call wake_sem_queue_do() prematurely ... as this call should obviously be paired with its _prepare()

counterpart. At least whenever possible, as there is no harm in calling
it bogusly as we do now in a few places. Immediate error semop(2) paths
that are far from ever having the task block can be simplified and avoid
a few unnecessary loads on their way out of the call as it is not deeply
nested.

Link: http://lkml.kernel.org/r/1474225896-10066-2-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

63980c80 14-Dec-2016 Shailesh Pandey <p.shailesh@samsung.com>

ipc/shm.c: coding style fixes

This patch fixes below warnings:

WARNING: Missing a blank line after declarations
WARNING: Block comments use a trailing */ on a separate line
ERROR: spaces required around that '=' (ctx:WxV)

Above warnings were reported by checkpatch.pl

Link: http://lkml.kernel.org/r/1478604980-18062-1-git-send-email-p.shailesh@samsung.com
Signed-off-by: Shailesh Pandey <p.shailesh@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

99989835 14-Dec-2016 Jiri Slaby <jirislaby@kernel.org>

ipc: msg, make msgrcv work with LONG_MIN

When LONG_MIN is passed to msgrcv, one would expect to recieve any
message. But convert_mode does *msgtyp = -*msgtyp and -LONG_MIN is
undefined. In particular, with my gcc -LONG_MIN produces -LONG_MIN
again.

So handle this case properly by assigning LONG_MAX to *msgtyp if
LONG_MIN was specified as msgtyp to msgrcv.

This code:
long msg[] = { 100, 200 };
int m = msgget(IPC_PRIVATE, IPC_CREAT | 0644);
msgsnd(m, &msg, sizeof(msg), 0);
msgrcv(m, &msg, sizeof(msg), LONG_MIN, 0);

produces currently nothing:

msgget(IPC_PRIVATE, IPC_CREAT|0644) = 65538
msgsnd(65538, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0
msgrcv(65538, ...

Except a UBSAN warning:

UBSAN: Undefined behaviour in ipc/msg.c:745:13
negation of -9223372036854775808 cannot be represented in type 'long int':

With the patch, I see what I expect:

msgget(IPC_PRIVATE, IPC_CREAT|0644) = 0
msgsnd(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0
msgrcv(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, -9223372036854775808, 0) = 16

Link: http://lkml.kernel.org/r/20161024082633.10148-1-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

194a6b5b 17-Nov-2016 Waiman Long <longman@redhat.com>

sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q

Currently the wake_q data structure is defined by the WAKE_Q() macro.
This macro, however, looks like a function doing something as "wake" is
a verb. Even checkpatch.pl was confused as it reported warnings like

WARNING: Missing a blank line after declarations
#548: FILE: kernel/futex.c:3665:
+ int ret;
+ WAKE_Q(wake_q);

This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
what the macro is doing and eliminates the checkpatch.pl warnings.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
[ Resolved conflict and added missing rename. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>

8c8d4d45 27-Oct-2016 Aristeu Rozanski <arozansk@redhat.com>

ipc: account for kmem usage on mqueue and msg

When kmem accounting switched from account by default to only account if
flagged by __GFP_ACCOUNT, IPC mqueue and messages was left out.

The production use case at hand is that mqueues should be customizable
via sysctls in Docker containers in a Kubernetes cluster. This can only
be safely allowed to the users of the cluster (without the risk that
they can cause resource shortage on a node, influencing other users'
containers) if all resources they control are bounded, i.e. accounted
for.

Link: http://lkml.kernel.org/r/1476806075-1210-1-git-send-email-arozansk@redhat.com
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Reported-by: Stefan Schimanski <sttts@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Stefan Schimanski <sttts@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2a1613a5 11-Oct-2016 Nikolay Borisov <kernel@kyup.com>

ipc/sem.c: add cond_resched in exit_sme

In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in
exit_sem. Apparently it's possible for the loop to take quite a long time
and it doesn't have a scheduling point in it. Since the codes is
executing under an rcu read section this may also cause rcu stalls, which
in turn block synchronize_rcu operations, which more or less de-stabilises
the whole system.

Fix this by introducing a cond_resched() at the beginning of the loop.

So this patch fixes the following:

NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119]
CPU: 10 PID: 18119 Comm: httpd Tainted: G O 4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000
RIP: 0010:[<ffffffff81614bc7>] [<ffffffff81614bc7>] _raw_spin_lock+0x17/0x30
RSP: 0018:ffff881c95553e40 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d
RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4
RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88
R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0
R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0
FS: 0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0
Call Trace:
? exit_sem+0x7c/0x280
do_exit+0x338/0xb40
do_group_exit+0x43/0xd0
SyS_exit_group+0x14/0x20
entry_SYSCALL_64_fastpath+0x16/0x6e

Link: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.com
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ed27f912 11-Oct-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/msg: avoid waking sender upon full queue

Blocked tasks queued in q_senders waiting for their message to fit in the
queue are blindly awoken every time we think there's a remote chance this
might happen. This could cause numerous (and expensive -- thundering
herd-ish) bogus wakeups if the queue is still really full. Adding to the
scheduling cost/overhead, there's also the fact that we need to take the
ipc object lock and requeue ourselves in the q_senders list.

By keeping track of the blocked sender's message size, we can know
previously if the wakeup ought to occur or not. Otherwise, to maintain
the current wakeup order we just move it to the tail. This is exactly
what occurs right now if the sender needs to go back to sleep.

The case of EIDRM is left completely untouched, as we need to wakeup all
the tasks, and shouldn't be playing games in the first place.

This patch was seen to save on the 'msgctl10' ltp testcase ~15% in context
switches (avg out of ten runs). Although these tests are really about
functionality (as opposed to performance), is does show the direct
benefits of the optimization.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1469748819-19484-6-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d0d6a2a9 11-Oct-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/msg: make ss_wakeup() kill arg boolean

... 'tis annoying.

Link: http://lkml.kernel.org/r/1469748819-19484-4-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e3658538 11-Oct-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/msg: batch queue sender wakeups

Currently the use of wake_qs in sysv msg queues are only for the receiver
tasks that are blocked on the queue. But blocked sender tasks (due to
queue size constraints) still are awoken with the ipc object lock held,
which can be a problem particularly for small sized queues and far from
gracious for -rt (just like it was for the receiver side).

The paths that actually wakeup a sender are obviously related to when we
are either getting rid of the queue or after (some) space is freed-up
after a receiver takes the msg (msgrcv). Furthermore, with the exception
of msgrcv, we can always piggy-back on expunge_all that has its own tasks
lined-up for waking. Finally, upon unlinking the message, it should be no
problem delaying the wakeups a bit until after we've released the lock.

Link: http://lkml.kernel.org/r/1469748819-19484-3-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ee51636c 11-Oct-2016 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

ipc/msg: implement lockless pipelined wakeups

This patch moves the wakeup_process() invocation so it is not done under
the ipc global lock by making use of a lockless wake_q. With this change,
the waiter is woken up once the message has been assigned and it does not
need to loop on SMP if the message points to NULL. In the signal case we
still need to check the pointer under the lock to verify the state.

This change should also avoid the introduction of preempt_disable() in -RT
which avoids a busy-loop which pools for the NULL -> !NULL change if the
waiter has a higher priority compared to the waker.

By making use of wake_qs, the logic of sysv msg queues is greatly
simplified (and very well suited as we can batch lockless wakeups),
particularly around the lockless receive algorithm.

This has been tested with Manred's pmsg-shared tool on a "AMD A10-7800
Radeon R7, 12 Compute Cores 4C+8G":

test | before | after | diff
-----------------|------------|------------|----------
pmsg-shared 8 60 | 19,347,422 | 30,442,191 | + ~57.34 %
pmsg-shared 4 60 | 21,367,197 | 35,743,458 | + ~67.28 %
pmsg-shared 2 60 | 22,884,224 | 24,278,200 | + ~6.09 %

Link: http://lkml.kernel.org/r/1469748819-19484-2-git-send-email-dave@stgolabs.net
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5864a2fd 11-Oct-2016 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix complex_count vs. simple op race

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

The patch is safe for backporting, i.e. it makes no assumptions
about memory barriers in spin_unlock_wait().

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
Reported-by: <felixh@informatik.uni-bremen.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

101105b1 10-Oct-2016 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning


078cd827 14-Sep-2016 Deepa Dinamani <deepa.kernel@gmail.com>

fs: Replace CURRENT_TIME with current_time() for inode timestamps

CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

78725596 22-Sep-2016 Eric W. Biederman <ebiederm@xmission.com>

Merge branch 'nsfs-ioctls' into HEAD

From: Andrey Vagin <avagin@openvz.org>

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system. Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble. I think this may actually a case for creating ioctls
> for these two cases. Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since Linux 4.X, the following ioctl(2) calls are supported for
namespace file descriptors. The correct syntax is:

fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
Returns a file descriptor that refers to an owning user names‐
pace.

NS_GET_PARENT
Returns a file descriptor that refers to a parent namespace.
This ioctl(2) can be used for pid and user namespaces. For
user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
meaning.

In addition to generic ioctl(2) errors, the following specific ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM The requested namespace is outside of the current namespace
scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
outside of the init namespace, so we can return EPERM in this case too.
> The fewer special cases the easier the code is to get
> correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>


bcac25a5 06-Sep-2016 Andrey Vagin <avagin@openvz.org>

kernel: add a helper to get an owning user namespace for a namespace

Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
This special cases was removed from this version. There is nothing
outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

df75e774 22-Sep-2016 Eric W. Biederman <ebiederm@xmission.com>

userns: When the per user per user namespace limit is reached return ENOSPC

The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC. It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

aba35661 08-Aug-2016 Eric W. Biederman <ebiederm@xmission.com>

ipcns: Add a limit on the number of ipc namespaces

Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

3bd080e4 02-Aug-2016 Alexey Dobriyan <adobriyan@gmail.com>

ipc: delete "nr_ipc_ns"

Write-only variable.

Link: http://lkml.kernel.org/r/20160708214356.GA6785@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9b24fef9 02-Aug-2016 Fabian Frederick <fabf@skynet.be>

sysv, ipc: fix security-layer leaking

Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
to receive rcu freeing function but used generic ipc_rcu_free() instead
of msg_rcu_free() which does security cleaning.

Running LTP msgsnd06 with kmemleak gives the following:

cat /sys/kernel/debug/kmemleak

unreferenced object 0xffff88003c0a11f8 (size 8):
comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s)
hex dump (first 8 bytes):
1b 00 00 00 01 00 00 00 ........
backtrace:
kmemleak_alloc+0x23/0x40
kmem_cache_alloc_trace+0xe1/0x180
selinux_msg_queue_alloc_security+0x3f/0xd0
security_msg_queue_alloc+0x2e/0x40
newque+0x4e/0x150
ipcget+0x159/0x1b0
SyS_msgget+0x39/0x40
entry_SYSCALL_64_fastpath+0x13/0x8f

Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to
only use ipc_rcu_free in case of security allocation failure in newary()

Fixes: 53dad6d3a8e ("ipc: fix race with LSMs")
Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a867d734 29-Jul-2016 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull userns vfs updates from Eric Biederman:
"This tree contains some very long awaited work on generalizing the
user namespace support for mounting filesystems to include filesystems
with a backing store. The real world target is fuse but the goal is
to update the vfs to allow any filesystem to be supported. This
patchset is based on a lot of code review and testing to approach that
goal.

While looking at what is needed to support the fuse filesystem it
became clear that there were things like xattrs for security modules
that needed special treatment. That the resolution of those concerns
would not be fuse specific. That sorting out these general issues
made most sense at the generic level, where the right people could be
drawn into the conversation, and the issues could be solved for
everyone.

At a high level what this patchset does a couple of simple things:

- Add a user namespace owner (s_user_ns) to struct super_block.

- Teach the vfs to handle filesystem uids and gids not mapping into
to kuids and kgids and being reported as INVALID_UID and
INVALID_GID in vfs data structures.

By assigning a user namespace owner filesystems that are mounted with
only user namespace privilege can be detected. This allows security
modules and the like to know which mounts may not be trusted. This
also allows the set of uids and gids that are communicated to the
filesystem to be capped at the set of kuids and kgids that are in the
owning user namespace of the filesystem.

One of the crazier corner casees this handles is the case of inodes
whose i_uid or i_gid are not mapped into the vfs. Most of the code
simply doesn't care but it is easy to confuse the inode writeback path
so no operation that could cause an inode write-back is permitted for
such inodes (aka only reads are allowed).

This set of changes starts out by cleaning up the code paths involved
in user namespace permirted mounts. Then when things are clean enough
adds code that cleanly sets s_user_ns. Then additional restrictions
are added that are possible now that the filesystem superblock
contains owner information.

These changes should not affect anyone in practice, but there are some
parts of these restrictions that are changes in behavior.

- Andy's restriction on suid executables that does not honor the
suid bit when the path is from another mount namespace (think
/proc/[pid]/fd/) or when the filesystem was mounted by a less
privileged user.

- The replacement of the user namespace implicit setting of MNT_NODEV
with implicitly setting SB_I_NODEV on the filesystem superblock
instead.

Using SB_I_NODEV is a stronger form that happens to make this state
user invisible. The user visibility can be managed but it caused
problems when it was introduced from applications reasonably
expecting mount flags to be what they were set to.

There is a little bit of work remaining before it is safe to support
mounting filesystems with backing store in user namespaces, beyond
what is in this set of changes.

- Verifying the mounter has permission to read/write the block device
during mount.

- Teaching the integrity modules IMA and EVM to handle filesystems
mounted with only user namespace root and to reduce trust in their
security xattrs accordingly.

- Capturing the mounters credentials and using that for permission
checks in d_automount and the like. (Given that overlayfs already
does this, and we need the work in d_automount it make sense to
generalize this case).

Furthermore there are a few changes that are on the wishlist:

- Get all filesystems supporting posix acls using the generic posix
acls so that posix_acl_fix_xattr_from_user and
posix_acl_fix_xattr_to_user may be removed. [Maintainability]

- Reducing the permission checks in places such as remount to allow
the superblock owner to perform them.

- Allowing the superblock owner to chown files with unmapped uids and
gids to something that is mapped so the files may be treated
normally.

I am not considering even obvious relaxations of permission checks
until it is clear there are no more corner cases that need to be
locked down and handled generically.

Many thanks to Seth Forshee who kept this code alive, and putting up
with me rewriting substantial portions of what he did to handle more
corner cases, and for his diligent testing and reviewing of my
changes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
fs: Call d_automount with the filesystems creds
fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
evm: Translate user/group ids relative to s_user_ns when computing HMAC
dquot: For now explicitly don't support filesystems outside of init_user_ns
quota: Handle quota data stored in s_user_ns in quota_setxquota
quota: Ensure qids map to the filesystem
vfs: Don't create inodes with a uid or gid unknown to the vfs
vfs: Don't modify inodes with a uid or gid unknown to the vfs
cred: Reject inodes with invalid ids in set_create_file_as()
fs: Check for invalid i_uid in may_follow_link()
vfs: Verify acls are valid within superblock's s_user_ns.
userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
fs: Refuse uid/gid changes which don't map into s_user_ns
selinux: Add support for unprivileged mounts from user namespaces
Smack: Handle labels consistently in untrusted mounts
Smack: Add support for unprivileged mounts from user namespaces
fs: Treat foreign mounts as nosuid
fs: Limit file caps to the user namespace of the super block
userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
userns: Remove implicit MNT_NODEV fragility.
...


4595ef88 26-Jul-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

shmem: make shmem_inode_info::lock irq-safe

We are going to need to call shmem_charge() under tree_lock to get
accoutning right on collapse of small tmpfs pages into a huge one.

The problem is that tree_lock is irq-safe and lockdep is not happy, that
we take irq-unsafe lock under irq-safe[1].

Let's convert the lock to irq-safe.

[1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc

Link: http://lkml.kernel.org/r/1466021202-61880-34-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c01d5b30 26-Jul-2016 Hugh Dickins <hughd@google.com>

shmem: get_unmapped_area align huge page

Provide a shmem_get_unmapped_area method in file_operations, called at
mmap time to decide the mapping address. It could be conditional on
CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
it unconditional.

shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout). Lots of conditions, and in most cases it
just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back to
ask for a larger arena, within which to align the mapping suitably.

There have to be some direct calls to shmem_get_unmapped_area(), not via
the file_operations: because of the way shmem_zero_setup() is called to
create a shmem object late in the mmap sequence, when MAP_SHARED is
requested with MAP_ANONYMOUS or /dev/zero. Though this only matters
when /proc/sys/vm/shmem_huge has been set.

Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a2982cc9 09-Jun-2016 Eric W. Biederman <ebiederm@xmission.com>

vfs: Generalize filesystem nodev handling.

Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV. Use this new function in all of the
places where MNT_NODEV was previously tested.

Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong. With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

3ee69014 20-May-2016 Eric W. Biederman <ebiederm@xmission.com>

ipc/mqueue: The mqueue filesystem should never contain executables

Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes
do not result in executable on mqueuefs by accident.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

d91ee87d 23-May-2016 Eric W. Biederman <ebiederm@xmission.com>

vfs: Pass data, ns, and ns->userns to mount_ns

Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.

Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.

Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks. The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts. Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.

Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

b236017a 30-May-2016 Eric W. Biederman <ebiederm@xmission.com>

ipc: Initialize ipc_namespace->user_ns early.

Allow the ipc namespace initialization code to depend on ns->user_ns
being set during initialization.

In particular this allows mq_init_ns to use ns->user_ns for permission
checks and initializating s_user_ns while the the mq filesystem is
being mounted.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Suggested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

be3e7844 24-May-2016 Peter Zijlstra <peterz@infradead.org>

locking/spinlock: Update spin_unlock_wait() users

With the modified semantics of spin_unlock_wait() a number of
explicit barriers can be removed. Also update the comment for the
do_exit() usecase, as that was somewhat stale/obscure.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

33ac2796 24-May-2016 Peter Zijlstra <peterz@infradead.org>

locking/barriers: Introduce smp_acquire__after_ctrl_dep()

Introduce smp_acquire__after_ctrl_dep(), this construct is not
uncommon, but the lack of this barrier is.

Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
semaphore code and the qspinlock code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

91f4f94e 23-May-2016 Michal Hocko <mhocko@suse.com>

ipc, shm: make shmem attach/detach wait for mmap_sem killable

shmat and shmdt rely on mmap_sem for write. If the waiting task gets
killed by the oom killer it would block oom_reaper from asynchronous
address space reclaim and reduce the chances of timely OOM resolving.
Wait for the lock in the killable mode and return with EINTR if the task
got killed while waiting.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a5f4db87 22-Mar-2016 Davidlohr Bueso <dave@stgolabs.net>

ipc/sem: make semctl setting sempid consistent

As indicated by bug#112271, Linux sets the sempid value upon semctl, and
not only for semop calls. However, within semctl we only do this for
SETVAL, leaving SETALL without updating the field, and therefore rather
inconsistent behavior when compared to other Unices.

There is really no documentation regarding this and therefore users
should not make assumptions. With this patch, along with updating
semctl.2 manpages, this scenario should become less ambiguous As such,
set sempid on SETALL cmd.

Also update some in-code documentation, specifying where the sempid is
set.

Passes ltp and custom testcase where a child (fork) does SETALL to the
set.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Philip Semanchuk <linux_kernel.20.ick@spamgourmet.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: PrasannaKumar Muralidharan <prasannatsmkumar@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1ac0b6de 17-Feb-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

ipc/shm: handle removed segments gracefully in shm_mmap()

remap_file_pages(2) emulation can reach file which represents removed
IPC ID as long as a memory segment is mapped. It breaks expectations of
IPC subsystem.

Test case (rewritten to be more human readable, originally autogenerated
by syzkaller[1]):

#define _GNU_SOURCE
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/mman.h>
#include <sys/shm.h>

#define PAGE_SIZE 4096

int main()
{
int id;
void *p;

id = shmget(IPC_PRIVATE, 3 * PAGE_SIZE, 0);
p = shmat(id, NULL, 0);
shmctl(id, IPC_RMID, NULL);
remap_file_pages(p, 3 * PAGE_SIZE, 0, 7, 0);

return 0;
}

The patch changes shm_mmap() and code around shm_lock() to propagate
locking error back to caller of shm_mmap().

[1] http://github.com/google/syzkaller

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cc673757 23-Jan-2016 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull final vfs updates from Al Viro:

- The ->i_mutex wrappers (with small prereq in lustre)

- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)

- followup to dedupe stuff merged this cycle

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration


1d5cfdb0 22-Jan-2016 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>

tree wide: use kvfree() than conditional kfree()/vfree()

There are many locations that do

if (memory_was_allocated_by_vmalloc)
vfree(ptr);
else
kfree(ptr);

but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Boris Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

2954e440 20-Jan-2016 Yaowei Bai <baiyaowei@cmss.chinamobile.com>

ipc/shm.c: is_file_shm_hugepages() can be boolean

Make is_file_shm_hugepages() return bool to improve readability due to
this particular function only using either one or zero as its return
value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5d097056 14-Jan-2016 Vladimir Davydov <vdavydov.dev@gmail.com>

kmemcg: account certain kmem allocations to memcg

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:

- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5f2a2d5d 06-Nov-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc,msg: drop dst nil validation in copy_msg

d0edd8528362 ("ipc: convert invalid scenarios to use WARN_ON") relaxed the
nil dst parameter check, originally being a full BUG_ON. However, this
check seems quite unnecessary when the only purpose is for
ceckpoint/restore (MSG_COPY flag):

o The copy variable is set initially to nil, apparently as a way of
ensuring that prepare_copy is previously called. Which is in fact done,
unconditionally at the beginning of do_msgrcv.

o There is no concurrency with 'copy' (stack allocated in do_msgrcv).

Furthermore, any errors in 'copy' (and thus prepare_copy/copy_msg) should
always handled by IS_ERR() family. Therefore remove this check altogether
as it can never occur with the current users.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b9a53227 29-Sep-2015 Linus Torvalds <torvalds@linux-foundation.org>

Initialize msg/shm IPC objects before doing ipc_addid()

As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
having initialized the IPC object state. Yes, we initialize the IPC
object in a locked state, but with all the lockless RCU lookup work,
that IPC object lock no longer means that the state cannot be seen.

We already did this for the IPC semaphore code (see commit e8577d1f0329:
"ipc/sem.c: fully initialize sem_array before making it visible") but we
clearly forgot about msg and shm.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d0edd852 09-Sep-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc: convert invalid scenarios to use WARN_ON

Considering Linus' past rants about the (ab)use of BUG in the kernel, I
took a look at how we deal with such calls in ipc. Given that any errors
or corruption in ipc code are most likely contained within the set of
processes participating in the broken mechanisms, there aren't really many
strong fatal system failure scenarios that would require a BUG call.
Also, if something is seriously wrong, ipc might not be the place for such
a BUG either.

1. For example, recently, a customer hit one of these BUG_ONs in shm
after failing shm_lock(). A busted ID imho does not merit a BUG_ON,
and WARN would have been better.

2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore.
I don't see how we can hit this anyway -- at least it should be IS_ERR.
The 'copy' arg from do_msgrcv is always set by calling prepare_copy()
first and foremost. We could also probably drop this check altogether.
Either way, it does not merit a BUG_ON.

3. No ->fault() callback for the fs getting the corresponding page --
seems selfish to make the system unusable.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3ed1f8a9 14-Aug-2015 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: update/correct memory barriers

sem_lock() did not properly pair memory barriers:

!spin_is_locked() and spin_unlock_wait() are both only control barriers.
The code needs an acquire barrier, otherwise the cpu might perform read
operations before the lock test.

As no primitive exists inside <include/spinlock.h> and since it seems
noone wants another primitive, the code creates a local primitive within
ipc/sem.c.

With regards to -stable:

The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
nop (just a preprocessor redefinition to improve the readability). The
bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
starting from 3.10).

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a9795584 14-Aug-2015 Herton R. Krzesinski <herton@redhat.com>

ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()

After we acquire the sma->sem_perm lock in exit_sem(), we are protected
against a racing IPC_RMID operation. Also at that point, we are the last
user of sem_undo_list. Therefore it isn't required that we acquire or use
ulp->lock.

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Rafael Aquini <aquini@redhat.com>
CC: Aristeu Rozanski <aris@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

602b8593 14-Aug-2015 Herton R. Krzesinski <herton@redhat.com>

ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits

The current semaphore code allows a potential use after free: in
exit_sem we may free the task's sem_undo_list while there is still
another task looping through the same semaphore set and cleaning the
sem_undo list at freeary function (the task called IPC_RMID for the same
semaphore set).

For example, with a test program [1] running which keeps forking a lot
of processes (which then do a semop call with SEM_UNDO flag), and with
the parent right after removing the semaphore set with IPC_RMID, and a
kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and
CONFIG_DEBUG_SPINLOCK, you can easily see something like the following
in the kernel log:

Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64
000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk
010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........
Prev obj: start=ffff88003b45c180, len=64
000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7....
Next obj: start=ffff88003b45c200, len=64
000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h).<....
BUG: spinlock wrong CPU on CPU#2, test/18028
general protection fault: 0000 [#1] SMP
Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
CPU: 2 PID: 18028 Comm: test Not tainted 4.2.0-rc5+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
RIP: spin_dump+0x53/0xc0
Call Trace:
spin_bug+0x30/0x40
do_raw_spin_unlock+0x71/0xa0
_raw_spin_unlock+0xe/0x10
freeary+0x82/0x2a0
? _raw_spin_lock+0xe/0x10
semctl_down.clone.0+0xce/0x160
? __do_page_fault+0x19a/0x430
? __audit_syscall_entry+0xa8/0x100
SyS_semctl+0x236/0x2c0
? syscall_trace_leave+0xde/0x130
entry_SYSCALL_64_fastpath+0x12/0x71
Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2c a4 81 31 c0 65 8b 15 eb 40 f3 7e e8 08 31 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
RIP [<ffffffff810d6053>] spin_dump+0x53/0xc0
RSP <ffff88003750fd68>
---[ end trace 783ebb76612867a0 ]---
NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053]
Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
RIP: native_read_tsc+0x0/0x20
Call Trace:
? delay_tsc+0x40/0x70
__delay+0xf/0x20
do_raw_spin_lock+0x96/0x140
_raw_spin_lock+0xe/0x10
sem_lock_and_putref+0x11/0x70
SYSC_semtimedop+0x7bf/0x960
? handle_mm_fault+0xbf6/0x1880
? dequeue_task_fair+0x79/0x4a0
? __do_page_fault+0x19a/0x430
? kfree_debugcheck+0x16/0x40
? __do_page_fault+0x19a/0x430
? __audit_syscall_entry+0xa8/0x100
? do_audit_syscall_entry+0x66/0x70
? syscall_trace_enter_phase1+0x139/0x160
SyS_semtimedop+0xe/0x10
SyS_semop+0x10/0x20
entry_SYSCALL_64_fastpath+0x12/0x71
Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 <55> 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9
Kernel panic - not syncing: softlockup: hung tasks

I wasn't able to trigger any badness on a recent kernel without the
proper config debugs enabled, however I have softlockup reports on some
kernel versions, in the semaphore code, which are similar as above (the
scenario is seen on some servers running IBM DB2 which uses semaphore
syscalls).

The patch here fixes the race against freeary, by acquiring or waiting
on the sem_undo_list lock as necessary (exit_sem can race with freeary,
while freeary sets un->semid to -1 and removes the same sem_undo from
list_proc or when it removes the last sem_undo).

After the patch I'm unable to reproduce the problem using the test case
[1].

[1] Test case used below:

#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/wait.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <errno.h>

#define NSEM 1
#define NSET 5

int sid[NSET];

void thread()
{
struct sembuf op;
int s;
uid_t pid = getuid();

s = rand() % NSET;
op.sem_num = pid % NSEM;
op.sem_op = 1;
op.sem_flg = SEM_UNDO;

semop(sid[s], &op, 1);
exit(EXIT_SUCCESS);
}

void create_set()
{
int i, j;
pid_t p;
union {
int val;
struct semid_ds *buf;
unsigned short int *array;
struct seminfo *__buf;
} un;

/* Create and initialize semaphore set */
for (i = 0; i < NSET; i++) {
sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
if (sid[i] < 0) {
perror("semget");
exit(EXIT_FAILURE);
}
}
un.val = 0;
for (i = 0; i < NSET; i++) {
for (j = 0; j < NSEM; j++) {
if (semctl(sid[i], j, SETVAL, un) < 0)
perror("semctl");
}
}

/* Launch threads that operate on semaphore set */
for (i = 0; i < NSEM * NSET * NSET; i++) {
p = fork();
if (p < 0)
perror("fork");
if (p == 0)
thread();
}

/* Free semaphore set */
for (i = 0; i < NSET; i++) {
if (semctl(sid[i], NSEM, IPC_RMID))
perror("IPC_RMID");
}

/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}

int main(int argc, char **argv)
{
pid_t p;

srand(time(NULL));

while (1) {
p = fork();
if (p < 0) {
perror("fork");
exit(EXIT_FAILURE);
}
if (p == 0) {
create_set();
goto end;
}

/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}
end:
return 0;
}

[akpm@linux-foundation.org: use normal comment layout]
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Rafael Aquini <aquini@redhat.com>
CC: Aristeu Rozanski <aris@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e1832f29 06-Aug-2015 Stephen Smalley <sds@tycho.nsa.gov>

ipc: use private shmem or hugetlbfs inodes for shm segments.

The shm implementation internally uses shmem or hugetlbfs inodes for shm
segments. As these inodes are never directly exposed to userspace and
only accessed through the shm operations which are already hooked by
security modules, mark the inodes with the S_PRIVATE flag so that inode
security initialization and permission checking is skipped.

This was motivated by the following lockdep warning:

======================================================
[ INFO: possible circular locking dependency detected ]
4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
-------------------------------------------------------
httpd/1597 is trying to acquire lock:
(&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
but task is already holding lock:
(&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&mm->mmap_sem){++++++}:
lock_acquire+0xc7/0x270
__might_fault+0x7a/0xa0
filldir+0x9e/0x130
xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
xfs_readdir+0x1b4/0x330 [xfs]
xfs_file_readdir+0x2b/0x30 [xfs]
iterate_dir+0x97/0x130
SyS_getdents+0x91/0x120
entry_SYSCALL_64_fastpath+0x12/0x76
-> #2 (&xfs_dir_ilock_class){++++.+}:
lock_acquire+0xc7/0x270
down_read_nested+0x57/0xa0
xfs_ilock+0x167/0x350 [xfs]
xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
xfs_attr_get+0xbd/0x190 [xfs]
xfs_xattr_get+0x3d/0x70 [xfs]
generic_getxattr+0x4f/0x70
inode_doinit_with_dentry+0x162/0x670
sb_finish_set_opts+0xd9/0x230
selinux_set_mnt_opts+0x35c/0x660
superblock_doinit+0x77/0xf0
delayed_superblock_init+0x10/0x20
iterate_supers+0xb3/0x110
selinux_complete_init+0x2f/0x40
security_load_policy+0x103/0x600
sel_write_load+0xc1/0x750
__vfs_write+0x37/0x100
vfs_write+0xa9/0x1a0
SyS_write+0x58/0xd0
entry_SYSCALL_64_fastpath+0x12/0x76
...

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Morten Stevens <mstevens@fedoraproject.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

de54b9ac 06-Aug-2015 Marcus Gelderie <redmnic@gmail.com>

ipc: modify message queue accounting to not take kernel data structures into account

A while back, the message queue implementation in the kernel was
improved to use btrees to speed up retrieval of messages, in commit
d6629859b36d ("ipc/mqueue: improve performance of send/recv").

That patch introducing the improved kernel handling of message queues
(using btrees) has, as a by-product, changed the meaning of the QSIZE
field in the pseudo-file created for the queue. Before, this field
reflected the size of the user-data in the queue. Since, it also takes
kernel data structures into account. For example, if 13 bytes of user
data are in the queue, on my machine the file reports a size of 61
bytes.

There was some discussion on this topic before (for example
https://lkml.org/lkml/2014/10/1/115). Commenting on a th lkml, Michael
Kerrisk gave the following background
(https://lkml.org/lkml/2015/6/16/74):

The pseudofiles in the mqueue filesystem (usually mounted at
/dev/mqueue) expose fields with metadata describing a message
queue. One of these fields, QSIZE, as originally implemented,
showed the total number of bytes of user data in all messages in
the message queue, and this feature was documented from the
beginning in the mq_overview(7) page. In 3.5, some other (useful)
work happened to break the user-space API in a couple of places,
including the value exposed via QSIZE, which now includes a measure
of kernel overhead bytes for the queue, a figure that renders QSIZE
useless for its original purpose, since there's no way to deduce
the number of overhead bytes consumed by the implementation.
(The other user-space breakage was subsequently fixed.)

This patch removes the accounting of kernel data structures in the
queue. Reporting the size of these data-structures in the QSIZE field
was a breaking change (see Michael's comment above). Without the QSIZE
field reporting the total size of user-data in the queue, there is no
way to deduce this number.

It should be noted that the resource limit RLIMIT_MSGQUEUE is counted
against the worst-case size of the queue (in both the old and the new
implementation). Therefore, the kernel overhead accounting in QSIZE is
not necessary to help the user understand the limitations RLIMIT imposes
on the processes.

Signed-off-by: Marcus Gelderie <redmnic@gmail.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: John Duffy <jb_duffy@btinternet.com>
Cc: Arto Bendiken <arto@bendiken.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6157dbbf 30-Jun-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc,sysv: return -EINVAL upon incorrect id/seqnum

In ipc_obtain_object_check we return -EIDRM when a bogus sequence number
is detected via ipc_checkid, while the ipc manpages state the following
return codes for such errors:

EIDRM <ID> points to a removed identifier.
EINVAL Invalid <ID> value, or unaligned, etc.

EIDRM should only be returned upon a RMID call (->deleted check), and thus
return EINVAL for wrong seq. This difference in semantics has also caused
real bugs, ie: https://bugzilla.redhat.com/show_bug.cgi?id=246509

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f8b59184 30-Jun-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc,sysv: make return -EIDRM when racing with RMID consistent

The ipc_lock helper is used by all forms of sysv ipc to acquire the ipc
object's spinlock. Upon error (bogus identifier), we always return
-EINVAL, whether the problem be in the idr path or because we raced with a
task performing RMID. For the later, however, all ipc related manpages,
state the that for:

EIDRM <ID> points to a removed identifier.

And return:

EINVAL Invalid <ID> value, or unaligned, etc.

Which (EINVAL) should only return once the ipc resource is deleted. For
all types of ipc this is done immediately upon a RMID command. However,
shared memory behaves slightly different as it can merely mark a segment
for deletion, and delay the actual freeing until there are no more active
consumers. Per shmctl(IPC_RMID) manpage:

""
Mark the segment to be destroyed. The segment will only actually
be destroyed after the last process detaches it (i.e., when the
shm_nattch member of the associated structure shmid_ds is zero).
""

Unlike ipc_lock, paths that behave "correctly", at least per the manpage,
involve controlling the ipc resource via *ctl(), doing the exact same
validity check as ipc_lock after right acquiring the spinlock:

if (!ipc_valid_object()) {
err = -EIDRM;
goto out_unlock;
}

Thus make ipc_lock consistent with the rest of ipc code and return -EIDRM
in ipc_lock when !ipc_valid_object().

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

55b7ae50 30-Jun-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc: rename ipc_obtain_object

... to ipc_obtain_object_idr, which is more meaningful and makes the code
slightly easier to follow.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ff35e5ef 30-Jun-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc,msg: provide barrier pairings for lockless receive

We currently use a full barrier on the sender side to to avoid receiver
tasks disappearing on us while still performing on the sender side wakeup.
We lack however, the proper CPU-CPU interactions pairing on the receiver
side which busy-waits for the message. Similarly, we do not need a full
smp_mb, and can relax the semantics for the writer and reader sides of the
message. This is safe as we are only ordering loads and stores to r_msg.
And in both smp_wmb and smp_rmb, there are no stores after the calls
_anyway_.

This obviously applies for pipelined_send and expunge_all, for EIRDM when
destroying a queue.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c5c8975b 30-Jun-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc,shm: move BUG_ON check into shm_lock

Upon every shm_lock call, we BUG_ON if an error was returned, indicating
racing either in idr or in shm_destroy. Move this logic into the locking.

[akpm@linux-foundation.org: simplify code]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c859aa83 30-Jun-2015 Pekka Enberg <penberg@kernel.org>

ipc/util.c: use kvfree() in ipc_rcu_free()

Use kvfree() instead of open-coding it.

Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fa6004ad 04-May-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc/mqueue: Implement lockless pipelined wakeups

This patch moves the wakeup_process() invocation so it is not done under
the info->lock by making use of a lockless wake_q. With this change, the
waiter is woken up once it is STATE_READY and it does not need to loop
on SMP if it is still in STATE_PENDING. In the timeout case we still need
to grab the info->lock to verify the state.

This change should also avoid the introduction of preempt_disable() in -rt
which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
change if the waiter has a higher priority compared to the waker.

Additionally, this patch micro-optimizes wq_sleep by using the cheaper
cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no
matter what, thus get rid of the implied barrier.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: George Spelvin <linux@horizon.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Mason <clm@fb.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1430748166.1940.17.camel@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>

9ec3a646 26-Apr-2015 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only


7f032d6e 15-Apr-2015 Joe Perches <joe@perches.com>

ipc: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

75c3cfa8 17-Mar-2015 David Howells <dhowells@redhat.com>

VFS: assorted weird filesystems: d_inode() annotations

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

52644c9a 17-Feb-2015 Davidlohr Bueso <dave@stgolabs.net>

ipc,sem: use current->state helpers

Call __set_current_state() instead of assigning the new state directly.
These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments, keeping
track of who changed the state.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

603ba7e4 16-Dec-2014 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs pile #2 from Al Viro:
"Next pile (and there'll be one or two more).

The large piece in this one is getting rid of /proc/*/ns/* weirdness;
among other things, it allows to (finally) make nameidata completely
opaque outside of fs/namei.c, making for easier further cleanups in
there"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coda_venus_readdir(): use file_inode()
fs/namei.c: fold link_path_walk() call into path_init()
path_init(): don't bother with LOOKUP_PARENT in argument
fs/namei.c: new helper (path_cleanup())
path_init(): store the "base" pointer to file in nameidata itself
make default ->i_fop have ->open() fail with ENXIO
make nameidata completely opaque outside of fs/namei.c
kill proc_ns completely
take the targets of /proc/*/ns/* symlinks to separate fs
bury struct proc_ns in fs/proc
copy address of proc_ns_ops into ns_common
new helpers: ns_alloc_inum/ns_free_inum
make proc_ns_operations work with struct ns_common * instead of void *
switch the rest of proc_ns_operations to working with &...->ns
netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
common object embedded into various struct ....ns


07a46ed2 12-Dec-2014 Dave Hansen <dave.hansen@linux.intel.com>

shmdt: use i_size_read() instead of ->i_size

Andrew Morton noted

http://lkml.kernel.org/r/20141104142027.a7a0d010772d84560b445f59@linux-foundation.org

that the shmdt uses inode->i_size outside of i_mutex being held.
There is one more case in shm.c in shm_destroy(). This converts
both users over to use i_size_read().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d3c97900 12-Dec-2014 Dave Hansen <dave.hansen@linux.intel.com>

ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments

This is a highly-contrived scenario. But, a single shmdt() call can be
induced in to unmapping memory from mulitple shm segments. Example code
is here:

http://www.sr71.net/~dave/intel/shmfun.c

The fix is pretty simple: Record the 'struct file' for the first VMA we
encounter and then stick to it. Decline to unmap anything not from the
same file and thus the same segment.

I found this by inspection and the odds of anyone hitting this in practice
are pretty darn small.

Lightly tested, but it's a pretty small patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0050ee05 12-Dec-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/msg: increase MSGMNI, remove scaling

SysV can be abused to allocate locked kernel memory. For most systems, a
small limit doesn't make sense, see the discussion with regards to SHMMAX.

Therefore: increase MSGMNI to the maximum supported.

And: If we ignore the risk of locking too much memory, then an automatic
scaling of MSGMNI doesn't make sense. Therefore the logic can be removed.

The code preserves auto_msgmni to avoid breaking any user space applications
that expect that the value exists.

Notes:
1) If an administrator must limit the memory allocations, then he can set
MSGMNI as necessary.

Or he can disable sysv entirely (as e.g. done by Android).

2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
to control latency vs. throughput:
If MSGMNB is large, then msgsnd() just returns and more messages can be queued
before a task switch to a task that calls msgrcv() is forced.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2e094abf 12-Dec-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()

When I fixed bugs in the sem_lock() logic, I was more conservative than
necessary. Therefore it is safe to replace the smp_mb() with smp_rmb().
And: With smp_rmb(), semop() syscalls are up to 10% faster.

The race we must protect against is:

sem->lock is free
sma->complex_count = 0
sma->sem_perm.lock held by thread B

thread A:

A: spin_lock(&sem->lock)

B: sma->complex_count++; (now 1)
B: spin_unlock(&sma->sem_perm.lock);

A: spin_is_locked(&sma->sem_perm.lock);
A: XXXXX memory barrier
A: if (sma->complex_count == 0)

Thread A must read the increased complex_count value, i.e. the read must
not be reordered with the read of sem_perm.lock done by spin_is_locked().

Since it's about ordering of reads, smp_rmb() is sufficient.

[akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

707c5960 10-Dec-2014 Al Viro <viro@zeniv.linux.org.uk>

Merge branch 'nsfs' into for-next


cbfe0de3 10-Dec-2014 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull VFS changes from Al Viro:
"First pile out of several (there _definitely_ will be more). Stuff in
this one:

- unification of d_splice_alias()/d_materialize_unique()

- iov_iter rewrite

- killing a bunch of ->f_path.dentry users (and f_dentry macro).

Getting that completed will make life much simpler for
unionmount/overlayfs, since then we'll be able to limit the places
sensitive to file _dentry_ to reasonably few. Which allows to have
file_inode(file) pointing to inode in a covered layer, with dentry
pointing to (negative) dentry in union one.

Still not complete, but much closer now.

- crapectomy in lustre (dead code removal, mostly)

- "let's make seq_printf return nothing" preparations

- assorted cleanups and fixes

There _definitely_ will be more piles"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
copy_from_iter_nocache()
new helper: iov_iter_kvec()
csum_and_copy_..._iter()
iov_iter.c: handle ITER_KVEC directly
iov_iter.c: convert copy_to_iter() to iterate_and_advance
iov_iter.c: convert copy_from_iter() to iterate_and_advance
iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
iov_iter.c: convert iov_iter_zero() to iterate_and_advance
iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
iov_iter.c: iterate_and_advance
iov_iter.c: macros for iterating over iov_iter
kill f_dentry macro
dcache: fix kmemcheck warning in switch_names
new helper: audit_file()
nfsd_vfs_write(): use file_inode()
ncpfs: use file_inode()
kill f_dentry uses
lockd: get rid of ->f_path.dentry->d_sb
...


33c42940 01-Nov-2014 Al Viro <viro@zeniv.linux.org.uk>

copy address of proc_ns_ops into ns_common

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

6344c433 31-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

new helpers: ns_alloc_inum/ns_free_inum

take struct ns_common *, for now simply wrappers around proc_{alloc,free}_inum()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

64964528 31-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

make proc_ns_operations work with struct ns_common * instead of void *

We can do that now. And kill ->inum(), while we are at it - all instances
are identical.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3c041184 31-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

switch the rest of proc_ns_operations to working with &...->ns

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

435d5f4b 31-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

common object embedded into various struct ....ns

for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

e8577d1f 02-Dec-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fully initialize sem_array before making it visible

ipc_addid() makes a new ipc identifier visible to everyone. New objects
start as locked, so that the caller can complete the initialization
after the call. Within struct sem_array, at least sma->sem_base and
sma->sem_nsems are accessed without any locks, therefore this approach
doesn't work.

Thus: Move the ipc_addid() to the end of the initialization.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Rik van Riel <riel@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9f45f5bf 31-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

new helper: audit_file()

... for situations when we don't have any candidate in pathnames - basically,
in descriptor-based syscalls.

[Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0d5e7580 13-Oct-2014 Mark Rustad <mark.d.rustad@intel.com>

ipc: resolve shadow warnings

Resolve some shadow warnings produced in W=2 builds by changing the name
of some parameters and local variables. Change instances of "s64"
because that clashes with the well-known typedef. Also change a local
variable with the name "up" because that clashes with the name of of the
"up" function for semaphores. These are hazards so eliminate the
hazards by renaming them.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d66a0520 13-Oct-2014 Rob Jones <rob.jones@codethink.co.uk>

ipc/util.c: use __seq_open_private() instead of seq_open()

Using __seq_open_private() removes boilerplate code from
sysvipc_proc_open().

The resultant code is shorter and easier to follow.

However, please note that __seq_open_private() call kzalloc() rather than
kmalloc() which may affect timing due to the memory initialisation
overhead.

Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bf77b94c 13-Oct-2014 Oleg Nesterov <oleg@redhat.com>

ipc/shm: kill the historical/wrong mm->start_stack check

do_shmat() is the only user of ->start_stack (proc just reports its
value), and this check looks ugly and wrong.

The reason for this check is not clear at all, and it wrongly assumes that
the stack can only grow down.

But the main problem is that in general mm->start_stack has nothing to do
with stack_vma->vm_start. Not only the application can switch to another
stack and even unmap this area, setup_arg_pages() expands the stack
without updating mm->start_stack during exec(). This means that in the
likely case "addr > start_stack - size - PAGE_SIZE * 5" is simply
impossible after find_vma_intersection() == F, or the stack can't grow
anyway because of RLIMIT_STACK.

Many thanks to Hugh for his explanations.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1195d94e 13-Oct-2014 Andrey Vagin <avagin@openvz.org>

ipc: always handle a new value of auto_msgmni

proc_dointvec_minmax() returns zero if a new value has been set. So we
don't need to check all charecters have been handled.

Below you can find two examples. In the new value has not been handled
properly.

$ strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n\0", 3) = 2
close(3) = 0
exit_group(0)
$ cat /sys/kernel/debug/tracing/trace

$strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n", 2) = 2
close(3) = 0

$ cat /sys/kernel/debug/tracing/trace
a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax

Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Joe Perches <joe@perches.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

28596c97 07-Oct-2014 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial

Pull "trivial tree" updates from Jiri Kosina:
"Usual pile from trivial tree everyone is so eagerly waiting for"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
Remove MN10300_PROC_MN2WS0038
mei: fix comments
treewide: Fix typos in Kconfig
kprobes: update jprobe_example.c for do_fork() change
Documentation: change "&" to "and" in Documentation/applying-patches.txt
Documentation: remove obsolete pcmcia-cs from Changes
Documentation: update links in Changes
Documentation: Docbook: Fix generated DocBook/kernel-api.xml
score: Remove GENERIC_HAS_IOMAP
gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
tty: doc: Fix grammar in serial/tty
dma-debug: modify check_for_stack output
treewide: fix errors in printk
genirq: fix reference in devm_request_threaded_irq comment
treewide: fix synchronize_rcu() in comments
checkstack.pl: port to AArch64
doc: queue-sysfs: minor fixes
init/do_mounts: better syntax description
MIPS: fix comment spelling
powerpc/simpleboot: fix comment
...


da3dae54 08-Sep-2014 Masanari Iida <standby24x7@gmail.com>

Documentation: Docbook: Fix generated DocBook/kernel-api.xml

This patch fix spelling typo found in DocBook/kernel-api.xml.
It is because the file is generated from the source comments,
I have to fix the comments in source codes.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

77e40aae 09-Aug-2014 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull namespace updates from Eric Biederman:
"This is a bunch of small changes built against 3.16-rc6. The most
significant change for users is the first patch which makes setns
drmatically faster by removing unneded rcu handling.

The next chunk of changes are so that "mount -o remount,.." will not
allow the user namespace root to drop flags on a mount set by the
system wide root. Aks this forces read-only mounts to stay read-only,
no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
mounts to stay no exec and it prevents unprivileged users from messing
with a mounts atime settings. I have included my test case as the
last patch in this series so people performing backports can verify
this change works correctly.

The next change fixes a bug in NFS that was discovered while auditing
nsproxy users for the first optimization. Today you can oops the
kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
with pid namespaces. I rebased and fixed the build of the
!CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
that no one to my knowledge bases anything on my tree fixing the typo
in place seems more responsible that requiring a typo-fix to be
backported as well.

The last change is a small semantic cleanup introducing
/proc/thread-self and pointing /proc/mounts and /proc/net at it. This
prevents several kinds of problemantic corner cases. It is a
user-visible change so it has a minute chance of causing regressions
so the change to /proc/mounts and /proc/net are individual one line
commits that can be trivially reverted. Unfortunately I lost and
could not find the email of the original reporter so he is not
credited. From at least one perspective this change to /proc/net is a
refgression fix to allow pthread /proc/net uses that were broken by
the introduction of the network namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
proc: Implement /proc/thread-self to point at the directory of the current thread
proc: Have net show up under /proc/<tgid>/task/<tid>
NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
mnt: Add tests for unprivileged remount cases that have found to be faulty
mnt: Change the default remount atime from relatime to the existing value
mnt: Correct permission checks in do_remount
mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
mnt: Only change user settable mount flags in remount
namespaces: Use task_lock and not rcu to protect nsproxy


83293c0f5 08-Aug-2014 Jack Miller <millerjo@us.ibm.com>

shm: allow exit_shm in parallel if only marking orphans

If shm_rmid_force (the default state) is not set then the shmids are only
marked as orphaned and does not require any add, delete, or locking of the
tree structure.

Seperate the sysctl on and off case, and only obtain the read lock. The
newly added list head can be deleted under the read lock because we are
only called with current and will only change the semids allocated by this
task and not manipulate the list.

This commit assumes that up_read includes a sufficient memory barrier for
the writes to be seen my others that later obtain a write lock.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Jack Miller <millerjo@us.ibm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ab602f79 08-Aug-2014 Jack Miller <millerjo@us.ibm.com>

shm: make exit_shm work proportional to task activity

This is small set of patches our team has had kicking around for a few
versions internally that fixes tasks getting hung on shm_exit when there
are many threads hammering it at once.

Anton wrote a simple test to cause the issue:

http://ozlabs.org/~anton/junkcode/bust_shm_exit.c

Before applying this patchset, this test code will cause either hanging
tracebacks or pthread out of memory errors.

After this patchset, it will still produce output like:

root@somehost:~# ./bust_shm_exit 1024 160
...
INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
INFO: Stall ended before state dump start
...

But the task will continue to run along happily, so we consider this an
improvement over hanging, even if it's a bit noisy.

This patch (of 3):

exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
walks every shared memory segment in the namespace. Thus the amount of
work is related to the number of shm segments in the namespace not the
number of segments that might need to be cleaned.

In addition, this occurs after the task has been notified the thread has
exited, so the number of tasks waiting for the ns shm rwsem can grow
without bound until memory is exausted.

Add a list to the task struct of all shmids allocated by this task. Init
the list head in copy_process. Use the ns->rwsem for locking. Add
segments after id is added, remove before removing from id.

On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
to handling of semaphore undo.

I chose a define for the init sequence since its a simple list init,
otherwise it would require a function call to avoid include loops between
the semaphore code and the task struct. Converting the list_del to
list_del_init for the unshare cases would remove the exit followed by
init, but I left it blow up if not inited.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Jack Miller <millerjo@us.ibm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

728dba3a 03-Feb-2014 Eric W. Biederman <ebiederm@xmission.com>

namespaces: Use task_lock and not rcu to protect nsproxy

The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
a sufficiently expensive system call that people have complained.

Upon inspect nsproxy no longer needs rcu protection for remote reads.
remote reads are rare. So optimize for same process reads and write
by switching using rask_lock instead.

This yields a simpler to understand lock, and a faster setns system call.

In particular this fixes a performance regression observed
by Rafael David Tinoco <rafael.tinoco@canonical.com>.

This is effectively a revert of Pavel Emelyanov's commit
cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
from 2007. The race this originialy fixed no longer exists as
do_notify_parent uses task_active_pid_ns(parent) instead of
parent->nsproxy.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

a5c5928b 06-Jun-2014 Joe Perches <joe@perches.com>

ipc: convert use of typedef ctl_table to struct ctl_table

This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9b44ee2e 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: add a printk_once for semctl(GETNCNT/GETZCNT)

The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT)
always (since 0.99.10) reported a thread as sleeping on all semaphores
that are listed in the semop() call.

The documented behavior (both in the Linux man page and in the Single
Unix Specification) is that a task should be reported on exactly one
semaphore: The semaphore that caused the thread to got to sleep.

This patch adds a pr_info_once() that is triggered if a thread hits the
relevant case.

The code triggers slightly too often, otherwise it would be necessary to
replicate the old code. As there are no known users of GETNCNT or
GETZCNT, this is done to prevent unnecessary bloat.

The task that triggered is reported with name (tsk->comm) and pid.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b220c57a 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: make semctl(,,{GETNCNT,GETZCNT}) standard compliant

SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task
waits on exactly one semaphore: The semaphore from the first operation
in the sop array that cannot proceed.

The Linux implementation never followed the standard, it tried to count
all semaphores that might be the reason why a task sleeps.

This patch fixes that.

Note:
a) The implementation assumes that GETNCNT and GETZCNT are rare operations,
therefore the code counts them only on demand.
(If they wouldn't be rare, then the non-compliance would have
been found earlier)

b) compared to the initial version of the patch, the BUG_ONs were removed
and it was clarified that the new behavior conforms to SUS.

Back-compatibility concerns:

Manfred:

: - there is no application in Fedora that uses GETNCNT or GETZCNT.
:
: - application that use only single-sop semop() are also safe, the
: difference only affects complex apps.
:
: - portable application are also safe, the new behavior is standard
: compliant.
:
: But that's it. The old behavior existed in Linux from 0.99.something
: until now.

Michael:

: * These operations seem to be very little used. Grepping the public
: source that is contained Fedora 20 source DVD, there appear to be no
: uses. Of course, this says nothing about uses in private /
: non-mainstream FOSS code, but it seems likely that the same pattern
: is followed there.
:
: * The existing behavior is hard enough to understand that I suspect
: that no one understood it well enough to rely on it anyway
: (especially as that behavior contradicted both man page and POSIX).
:
: So, there's a chance of breakage, but I estimate that it's minute.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ed247b7c 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: store which operation blocks in perform_atomic_semop()

Preparation for the next patch:

In the slow-path of perform_atomic_semop(), store a pointer to the
operation that caused the operation to block.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d198cd6d 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: change perform_atomic_semop parameters

Right now, perform_atomic_semop gets the content of sem_queue as
individual fields. Changes that, instead pass a pointer to sem_queue.

This is a preparation for the next patch: it uses sem_queue to store the
reason why a task must sleep.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2f2ed41d 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: remove code duplication

count_semzcnt and count_semncnt are more of less identical. The patch
creates a single function that either counts the number of tasks waiting
for zero or waiting due to a decrease operation.

Compared to the initial version, the BUG_ONs were removed.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1994862d 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: bugfix for semctl(,,GETZCNT)

GETZCNT is supposed to return the number of threads that wait until a
semaphore value becomes 0.

The current implementation overlooks complex operations that contain
both wait-for-zero operation and operations that alter at least one
semaphore.

The patch fixes that. It's intentionally copy&paste, this will be
cleaned up in the next patch.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4bb6657d 06-Jun-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc,msg: document volatile r_msg

The need for volatile is not obvious, document it.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3440a6bd 06-Jun-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc,msg: move some msgq ns code around

Nothing big and no logical changes, just get rid of some redundant
function declarations. Move msg_[init/exit]_ns down the end of the
file.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f75a2f35 06-Jun-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc,msg: use current->state helpers

Call __set_current_state() instead of assigning the new state directly.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Manfred Spraul <manfred@colorfullif.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1376327c 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/shm.c: check for integer overflow during shmget.

SHMMAX is the upper limit for the size of a shared memory segment, counted
in bytes. The actual allocation is that size, rounded up to the next full
page.

Add a check that prevents the creation of segments where the rounded up
size causes an integer overflow.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

09c6eb1f 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/shm.c: check for overflows of shm_tot

shm_tot counts the total number of pages used by shm segments.

If SHMALL is ULONG_MAX (or nearly ULONG_MAX), then the number can
overflow. Subsequent calls to shmctl(,SHM_INFO,) would return wrong
values for shm_tot.

The patch adds a detection for overflows.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

247a8ce8 06-Jun-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/shm.c: check for ulong overflows in shmat

The increase of SHMMAX/SHMALL is a 4 patch series.

The change itself is trivial, the only problem are interger overflows.
The overflows are not new, but if we make huge values the default, then
the code should be free from overflows.

SHMMAX:

- shmmem_file_setup places a hard limit on the segment size:
MAX_LFS_FILESIZE.

On 32-bit, the limit is > 1 TB, i.e. 4 GB-1 byte segments are
possible. Rounded up to full pages the actual allocated size
is 0. --> must be fixed, patch 3

- shmat:
- find_vma_intersection does not handle overflows properly.
--> must be fixed, patch 1

- the rest is fine, do_mmap_pgoff limits mappings to TASK_SIZE
and checks for overflows (i.e.: map 2 GB, starting from
addr=2.5GB fails).

SHMALL:
- after creating 8192 segments size (1L<<63)-1, shm_tot overflows and
returns 0. --> must be fixed, patch 2.

Userspace:
- Obviously, there could be overflows in userspace. There is nothing
we can do, only use values smaller than ULONG_MAX.
I ended with "ULONG_MAX - 1L<<24":

- TASK_SIZE cannot be used because it is the size of the current
task. Could be 4G if it's a 32-bit task on a 64-bit kernel.

- The maximum size is not standardized across archs:
I found TASK_MAX_SIZE, TASK_SIZE_MAX and TASK_SIZE_64.

- Just in case some arch revives a 4G/4G split, nearly
ULONG_MAX is a valid segment size.

- Using "0" as a magic value for infinity is even worse, because
right now 0 means 0, i.e. fail all allocations.

This patch (of 4):

find_vma_intersection() does not work as intended if addr+size overflows.
The patch adds a manual check before the call to find_vma_intersection.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

46c0a8ca 06-Jun-2014 Paul McQuade <paulmcquad@gmail.com>

ipc, kernel: clear whitespace

trailing whitespace

Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7153e402 06-Jun-2014 Paul McQuade <paulmcquad@gmail.com>

ipc, kernel: use Linux headers

Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
Use #include <linux/types.h> instead of <asm/types.h>

Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

eb66ec44 06-Jun-2014 Mathias Krause <minipli@googlemail.com>

ipc: constify ipc_ops

There is no need to recreate the very same ipc_ops structure on every
kernel entry for msgget/semget/shmget. Just declare it static and be
done with it. While at it, constify it as we don't modify the structure
at runtime.

Found in the PaX patch, written by the PaX Team.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6d08a256 07-Apr-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: use device_initcall

... since __initcall is now deprecated.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

187841a8 07-Apr-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc/compat.c: remove sc_semopm macro

This macro appears to have been introduced back in the 2.5 era for
semtimedop32 backward compatibility on ia32:

https://lkml.org/lkml/2003/4/28/78

Nowadays, this syscall in compat just defaults back to the code found in
sem.c, so it is no longer used and can thus be removed:

long compat_sys_semtimedop(int semid, struct sembuf __user *tsems,
unsigned nsops, const struct compat_timespec __user *timeout)
{
struct timespec __user *ts64;
if (compat_convert_timespec(&ts64, timeout))
return -EFAULT;
return sys_semtimedop(semid, tsems, nsops, ts64);
}

Furthermore, there are no users in compat.c. After this change, kernel
builds just fine with both CONFIG_SYSVIPC_COMPAT and CONFIG_SYSVIPC.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7125764c 02-Apr-2014 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull compat time conversion changes from Peter Anvin:
"Despite the branch name this is really neither an x86 nor an
x32-specific patchset, although it the implementation of the
discussions that followed the x32 security hole a few months ago.

This removes get/put_compat_timespec/val() and replaces them with
compat_get/put_timespec/val() which are savvy as to the current status
of COMPAT_USE_64BIT_TIME.

It removes several unused and/or incorrect/misleading functions (like
compat_put_timeval_convert which doesn't in fact do any conversion)
and also replaces several open-coded implementations what is now
called compat_convert_timespec() with that function"

* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
compat: Fix sparse address space warnings
compat: Get rid of (get|put)_compat_time(val|spec)


190f9186 31-Mar-2014 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 compat wrapper rework from Heiko Carstens:
"S390 compat system call wrapper simplification work.

The intention of this work is to get rid of all hand written assembly
compat system call wrappers on s390, which perform proper sign or zero
extension, or pointer conversion of compat system call parameters.
Instead all of this should be done with C code eg by using Al's
COMPAT_SYSCALL_DEFINEx() macro.

Therefore all common code and s390 specific compat system calls have
been converted to the COMPAT_SYSCALL_DEFINEx() macro.

In order to generate correct code all compat system calls may only
have eg compat_ulong_t parameters, but no unsigned long parameters.
Those patches which change parameter types from unsigned long to
compat_ulong_t parameters are separate in this series, but shouldn't
cause any harm.

The only compat system calls which intentionally have 64 bit
parameters (preadv64 and pwritev64) in support of the x86/32 ABI
haven't been changed, but are now only available if an architecture
defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64.

System calls which do not have a compat variant but still need proper
zero extension on s390, like eg "long sys_brk(unsigned long brk)" will
get a proper wrapper function with the new s390 specific
COMPAT_SYSCALL_WRAPx() macro:

COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk);

which generates the following code (simplified):

asmlinkage long sys_brk(unsigned long brk);
asmlinkage long compat_sys_brk(long brk)
{
return sys_brk((u32)brk);
}

Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines
includes both linux/syscall.h and linux/compat.h, it will generate
build errors, if the declaration of sys_brk() doesn't match, or if
there exists a non-matching compat_sys_brk() declaration.

In addition this will intentionally result in a link error if
somewhere else a compat_sys_brk() function exists, which probably
should have been used instead. Two more BUILD_BUG_ONs make sure the
size and type of each compat syscall parameter can be handled
correctly with the s390 specific macros.

I converted the compat system calls step by step to verify the
generated code is correct and matches the previous code. In fact it
did not always match, however that was always a bug in the hand
written asm code.

In result we get less code, less bugs, and much more sanity checking"

* 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits)
s390/compat: add copyright statement
compat: include linux/unistd.h within linux/compat.h
s390/compat: get rid of compat wrapper assembly code
s390/compat: build error for large compat syscall args
mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
ipc/compat: convert to COMPAT_SYSCALL_DEFINE
fs/compat: convert to COMPAT_SYSCALL_DEFINE
security/compat: convert to COMPAT_SYSCALL_DEFINE
mm/compat: convert to COMPAT_SYSCALL_DEFINE
net/compat: convert to COMPAT_SYSCALL_DEFINE
kernel/compat: convert to COMPAT_SYSCALL_DEFINE
fs/compat: optional preadv64/pwrite64 compat system calls
ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t
s390/compat: partial parameter conversion within syscall wrappers
s390/compat: automatic zero, sign and pointer conversion of syscalls
s390/compat: add sync_file_range and fallocate compat syscalls
...


4f87dac3 10-Mar-2014 Michael Kerrisk <mtk.manpages@gmail.com>

ipc: Fix 2 bugs in msgrcv() MSG_COPY implementation

While testing and documenting the msgrcv() MSG_COPY flag that Stanislav
Kinsbursky added in commit 4a674f34ba04 ("ipc: introduce message queue
copy feature" => kernel 3.8), I discovered a couple of bugs in the
implementation. The two bugs concern MSG_COPY interactions with other
msgrcv() flags, namely:

(A) MSG_COPY + MSG_EXCEPT
(B) MSG_COPY + !IPC_NOWAIT

The bugs are distinct (and the fix for the first one is obvious),
however my fix for both is a single-line patch, which is why I'm
combining them in a single mail, rather than writing two mails+patches.

===== (A) MSG_COPY + MSG_EXCEPT =====

With the addition of the MSG_COPY flag, there are now two msgrcv()
flags--MSG_COPY and MSG_EXCEPT--that modify the meaning of the 'msgtyp'
argument in unrelated ways. Specifying both in the same call is a
logical error that is currently permitted, with the effect that MSG_COPY
has priority and MSG_EXCEPT is ignored. The call should give an error
if both flags are specified. The patch below implements that behavior.

===== (B) (B) MSG_COPY + !IPC_NOWAIT =====

The test code that was submitted in commit 3a665531a3b7 ("selftests: IPC
message queue copy feature test") shows MSG_COPY being used in
conjunction with IPC_NOWAIT. In other words, if there is no message at
the position 'msgtyp'. return immediately with the error in ENOMSG.

What was not (fully) tested is the behavior if MSG_COPY is specified
*without* IPC_NOWAIT, and there is an odd behavior. If the queue
contains less than 'msgtyp' messages, then the call blocks until the
next message is written to the queue. At that point, the msgrcv() call
returns a copy of the newly added message, regardless of whether that
message is at the ordinal position 'msgtyp'. This is clearly bogus, and
problematic for applications that might want to make use of the MSG_COPY
flag.

I considered the following possible solutions to this problem:

(1) Force the call to block until a message *does* appear at the
position 'msgtyp'.

(2) If the MSG_COPY flag is specified, the kernel should implicitly add
IPC_NOWAIT, so that the call fails with ENOMSG for this case.

(3) If the MSG_COPY flag is specified, but IPC_NOWAIT is not, generate
an error (probably, EINVAL is the right one).

I do not know if any application would really want to have the
functionality of solution (1), especially since an application can
determine in advance the number of messages in the queue using msgctl()
IPC_STAT. Obviously, this solution would be the most work to implement.

Solution (2) would have the effect of silently fixing any applications
that tried to employ broken behavior. However, it would mean that if we
later decided to implement solution (1), then user-space could not
easily detect what the kernel supports (but, since I'm somewhat doubtful
that solution (1) is needed, I'm not sure that this is much of a
problem).

Solution (3) would have the effect of informing broken applications that
they are doing something broken. The downside is that this would cause
a ABI breakage for any applications that are currently employing the
broken behavior. However:

a) Those applications are almost certainly not getting the results they
expect.
b) Possibly, those applications don't even exist, because MSG_COPY is
currently hidden behind CONFIG_CHECKPOINT_RESTORE.

The upside of solution (3) is that if we later decided to implement
solution (1), user-space could determine what the kernel supports, via
the error return.

In my view, solution (3) is mildly preferable to solution (2), and
solution (1) could still be done later if anyone really cares. The
patch below implements solution (3).

PS. For anyone out there still listening, it's the usual story:
documenting an API (and the thinking about, and the testing of the API,
that documentation entails) is the one of the single best ways of
finding bugs in the API, as I've learned from a lot of experience. Best
to do that documentation before releasing the API.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: stable@vger.kernel.org
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8eee9093 04-Mar-2014 Heiko Carstens <hca@linux.ibm.com>

ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types

In order to allow the COMPAT_SYSCALL_DEFINE macro generate code that
performs proper zero and sign extension convert all 64 bit parameters
to their corresponding 32 bit compat counterparts.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>

5d70a596 04-Mar-2014 Heiko Carstens <hca@linux.ibm.com>

ipc/compat: convert to COMPAT_SYSCALL_DEFINE

Convert all compat system call functions where all parameter types
have a size of four or less than four bytes, or are pointer types
to COMPAT_SYSCALL_DEFINE.
The implicit casts within COMPAT_SYSCALL_DEFINE will perform proper
zero and sign extension to 64 bit of all parameters if needed.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>

291fdb0b 03-Mar-2014 Heiko Carstens <hca@linux.ibm.com>

ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t

Change the type of compat_sys_msgrcv's msgtyp parameter from long
to compat_long_t, since compat user space passes only a 32 bit signed
value.
Let the compat wrapper do proper sign extension to 64 bit of this
parameter.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>

f3713fd9 25-Feb-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc,mqueue: remove limits for the amount of system-wide queues

Commit 93e6f119c0ce ("ipc/mqueue: cleanup definition names and
locations") added global hardcoded limits to the amount of message
queues that can be created. While these limits are per-namespace,
reality is that it ends up breaking userspace applications.
Historically users have, at least in theory, been able to create up to
INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
for some workloads and use cases. For instance, Madars reports:

"This update imposes bad limits on our multi-process application. As
our app uses approaches that each process opens its own set of queues
(usually something about 3-5 queues per process). In some scenarios
we might run up to 3000 processes or more (which of-course for linux
is not a problem). Thus we might need up to 9000 queues or more. All
processes run under one user."

Other affected users can be found in launchpad bug #1155695:
https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695

Instead of increasing this limit, revert it entirely and fallback to the
original way of dealing queue limits -- where once a user's resource
limit is reached, and all memory is used, new queues cannot be created.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reported-by: Madars Vitolins <m@silodev.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

81993e81 01-Feb-2014 H. Peter Anvin <hpa@linux.intel.com>

compat: Get rid of (get|put)_compat_time(val|spec)

We have two APIs for compatiblity timespec/val, with confusingly
similar names. compat_(get|put)_time(val|spec) *do* handle the case
where COMPAT_USE_64BIT_TIME is set, whereas
(get|put)_compat_time(val|spec) do not. This is an accident waiting
to happen.

Clean it up by favoring the full-service version; the limited version
is replaced with double-underscore versions static to kernel/compat.c.

A common pattern is to convert a struct timespec to kernel format in
an allocation on the user stack. Unfortunately it is open-coded in
several places. Since this allocation isn't actually needed if
COMPAT_USE_64BIT_TIME is true (since user format == kernel format)
encapsulate that whole pattern into the function
compat_convert_timespec(). An equivalent function should be written
for struct timeval if it is needed in the future.

Finally, get rid of compat_(get|put)_timeval_convert(): each was only
used once, and the latter was not even doing what the function said
(no conversion actually was being done.) Moving the conversion into
compat_sys_settimeofday() itself makes the code much more similar to
sys_settimeofday() itself.

v3: Remove unused compat_convert_timeval().

v2: Drop bogus "const" in the destination argument for
compat_convert_time*().

Cc: Mauro Carvalho Chehab <m.chehab@samsung.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Tested-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>

e7ca2552 27-Jan-2014 Mateusz Guzik <mguzik@redhat.com>

ipc: fix compat msgrcv with negative msgtyp

Compat function takes msgtyp argument as u32 and passes it down to
do_msgrcv which results in casting to long, thus the sign is lost and we
get a big positive number instead.

Cast the argument to signed type before passing it down.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Reported-by: Gabriellla Schmidt <gsc@bruker.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ffa571da 27-Jan-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc,msg: document barriers

Both expunge_all() and pipeline_send() rely on both a nil msg value and
a full barrier to guarantee the correct ordering when waking up a task.

While its counterpart at the receiving end is well documented for the
lockless recv algorithm, we still need to document these specific
smp_mb() calls.

[akpm@linux-foundation.org: fix typo, per Mike]
[akpm@linux-foundation.org: mroe tpyos]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

daf948c7 27-Jan-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: delete seq_max field in struct ipc_ids

This field is only used to reset the ids seq number if it exceeds the
smaller of INT_MAX/SEQ_MULTIPLIER and USHRT_MAX, and can therefore be
moved out of the structure and into its own macro. Since each
ipc_namespace contains a table of 3 pointers to struct ipc_ids we can
save space in instruction text:

text data bss dec hex filename
56232 2348 24 58604 e4ec ipc/built-in.o
56216 2348 24 58588 e4dc ipc/built-in.o-after

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Jonathan Gonzalez <jgonzalez@linets.cl>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8dc5cd04 27-Jan-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: simplify sysvipc_proc_open() return

Get rid of silly/useless label jumping.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

95d4eb28 27-Jan-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: remove useless return statement

Only found in ipc_rmid().

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3ab08fe2 27-Jan-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: remove braces for single statements

Deal with checkpatch messages:
WARNING: braces {} are not necessary for single statement blocks

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8001c858 27-Jan-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: standardize code comments

IPC commenting style is all over the place, *specially* in util.c. This
patch orders things a bit.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

239521f3 27-Jan-2014 Manfred Spraul <manfred@colorfullife.com>

ipc: whitespace cleanup

The ipc code does not adhere the typical linux coding style.
This patch fixes lots of simple whitespace errors.

- mostly autogenerated by
scripts/checkpatch.pl -f --fix \
--types=pointer_location,spacing,space_before_tab
- one manual fixup (keep structure members tab-aligned)
- removal of additional space_before_tab that were not found by --fix

Tested with some of my msg and sem test apps.

Andrew: Could you include it in -mm and move it towards Linus' tree?

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Suggested-by: Li Bin <huawei.libin@huawei.com>
Cc: Joe Perches <joe@perches.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

72a8ff2f 27-Jan-2014 Rafael Aquini <aquini@redhat.com>

ipc: change kern_ipc_perm.deleted type to bool

struct kern_ipc_perm.deleted is meant to be used as a boolean toggle, and
the changes introduced by this patch are just to make the case explicit.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0f3d2b01 27-Jan-2014 Rafael Aquini <aquini@redhat.com>

ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races

After the locking semantics for the SysV IPC API got improved, a couple
of IPC_RMID race windows were opened because we ended up dropping the
'kern_ipc_perm.deleted' check performed way down in ipc_lock(). The
spotted races got sorted out by re-introducing the old test within the
racy critical sections.

This patch introduces ipc_valid_object() to consolidate the way we cope
with IPC_RMID races by using the same abstraction across the API
implementation.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

78f5009c 27-Jan-2014 Petr Mladek <pmladek@suse.cz>

ipc/sem.c: avoid overflow of semop undo (semadj) value

When trying to understand semop code, I found a small mistake in the check
for semadj (undo) value overflow. The new undo value is not stored
immediately and next potential checks are done against the old value.

The failing scenario is not much practical. One semop call has to do more
operations on the same semaphore. Also semval and semadj must have
different values, so there has to be some operations without SEM_UNDO
flag. For example:

struct sembuf depositor_op[1];
struct sembuf collector_op[2];

depositor_op[0].sem_num = 0;
depositor_op[0].sem_op = 20000;
depositor_op[0].sem_flg = 0;

collector_op[0].sem_num = 0;
collector_op[0].sem_op = -10000;
collector_op[0].sem_flg = SEM_UNDO;
collector_op[1].sem_num = 0;
collector_op[1].sem_op = -10000;
collector_op[1].sem_flg = SEM_UNDO;

if (semop(semid, depositor_op, 1) == -1)
{ perror("Failed to do 1st deposit"); return 1; }

if (semop(semid, collector_op, 2) == -1)
{ perror("Failed to do 1st collect"); return 1; }

if (semop(semid, depositor_op, 1) == -1)
{ perror("Failed to do 2nd deposit"); return 1; }

if (semop(semid, collector_op, 2) == -1)
{ perror("Failed to do 2nd collect"); return 1; }

return 0;

It passes without error now but the semadj value has overflown in the 2nd
collector operation.

[akpm@linux-foundation.org: restore lessened scope of local `undo']
[davidlohr@hp.com: correct header comment for perform_atomic_semop]
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3a72660b 21-Nov-2013 Jesper Nilsson <jesper.nilsson@axis.com>

ipc,shm: correct error return value in shmctl (SHM_UNLOCK)

Commit 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl")
restructured the ipc shm to shorten critical region, but introduced a
path where the return value could be -EPERM, even if the operation
actually was performed.

Before the commit, the err return value was reset by the return value
from security_shm_shmctl() after the if (!ns_capable(...)) statement.

Now, we still exit the if statement with err set to -EPERM, and in the
case of SHM_UNLOCK, it is not reset at all, and used as the return value
from shmctl.

To fix this, we only set err when errors occur, leaving the fallthrough
case alone.

Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> [3.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a399b29d 21-Nov-2013 Greg Thelen <gthelen@google.com>

ipc,shm: fix shm_file deletion races

When IPC_RMID races with other shm operations there's potential for
use-after-free of the shm object's associated file (shm_file).

Here's the race before this patch:

TASK 1 TASK 2
------ ------
shm_rmid()
ipc_lock_object()
shmctl()
shp = shm_obtain_object_check()

shm_destroy()
shum_unlock()
fput(shp->shm_file)
ipc_lock_object()
shmem_lock(shp->shm_file)
<OOPS>

The oops is caused because shm_destroy() calls fput() after dropping the
ipc_lock. fput() clears the file's f_inode, f_path.dentry, and
f_path.mnt, which causes various NULL pointer references in task 2. I
reliably see the oops in task 2 if with shmlock, shmu

This patch fixes the races by:
1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock().
2) modify at risk operations to check shm_file while holding
ipc_object_lock().

Example workloads, which each trigger oops...

Workload 1:
while true; do
id=$(shmget 1 4096)
shm_rmid $id &
shmlock $id &
wait
done

The oops stack shows accessing NULL f_inode due to racing fput:
_raw_spin_lock
shmem_lock
SyS_shmctl

Workload 2:
while true; do
id=$(shmget 1 4096)
shmat $id 4096 &
shm_rmid $id &
wait
done

The oops stack is similar to workload 1 due to NULL f_inode:
touch_atime
shmem_mmap
shm_mmap
mmap_region
do_mmap_pgoff
do_shmat
SyS_shmat

Workload 3:
while true; do
id=$(shmget 1 4096)
shmlock $id
shm_rmid $id &
shmunlock $id &
wait
done

The oops stack shows second fput tripping on an NULL f_inode. The
first fput() completed via from shm_destroy(), but a racing thread did
a get_file() and queued this fput():
locks_remove_flock
__fput
____fput
task_work_run
do_notify_resume
int_signal

Fixes: c2c737a0461e ("ipc,shm: shorten critical region for shmat")
Fixes: 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl")
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> # 3.10.17+ 3.11.6+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5cbb3d21 12-Nov-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew Morton)

Merge first patch-bomb from Andrew Morton:
"Quite a lot of other stuff is banked up awaiting further
next->mainline merging, but this batch contains:

- Lots of random misc patches
- OCFS2
- Most of MM
- backlight updates
- lib/ updates
- printk updates
- checkpatch updates
- epoll tweaking
- rtc updates
- hfs
- hfsplus
- documentation
- procfs
- update gcov to gcc-4.7 format
- IPC"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (269 commits)
ipc, msg: fix message length check for negative values
ipc/util.c: remove unnecessary work pending test
devpts: plug the memory leak in kill_sb
./Makefile: export initial ramdisk compression config option
init/Kconfig: add option to disable kernel compression
drivers: w1: make w1_slave::flags long to avoid memory corruption
drivers/w1/masters/ds1wm.cuse dev_get_platdata()
drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
drivers/memstick/core/mspro_block.c: fix attributes array allocation
drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
kernel/panic.c: reduce 1 byte usage for print tainted buffer
gcov: reuse kbasename helper
kernel/gcov/fs.c: use pr_warn()
kernel/module.c: use pr_foo()
gcov: compile specific gcov implementation based on gcc version
gcov: add support for gcc 4.7 gcov format
gcov: move gcov structs definitions to a gcc version specific file
kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
...


9bc9ccd7 12-Nov-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs updates from Al Viro:
"All kinds of stuff this time around; some more notable parts:

- RCU'd vfsmounts handling
- new primitives for coredump handling
- files_lock is gone
- Bruce's delegations handling series
- exportfs fixes

plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
ecryptfs: ->f_op is never NULL
locks: break delegations on any attribute modification
locks: break delegations on link
locks: break delegations on rename
locks: helper functions for delegation breaking
locks: break delegations on unlink
namei: minor vfs_unlink cleanup
locks: implement delegations
locks: introduce new FL_DELEG lock flag
vfs: take i_mutex on renamed file
vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
vfs: don't use PARENT/CHILD lock classes for non-directories
vfs: pull ext4's double-i_mutex-locking into common code
exportfs: fix quadratic behavior in filehandle lookup
exportfs: better variable name
exportfs: move most of reconnect_path to helper function
exportfs: eliminate unused "noprogress" counter
exportfs: stop retrying once we race with rename/remove
exportfs: clear DISCONNECTED on all parents sooner
exportfs: more detailed comment for path_reconnect
...


4e9b45a1 12-Nov-2013 Mathias Krause <minipli@googlemail.com>

ipc, msg: fix message length check for negative values

On 64 bit systems the test for negative message sizes is bogus as the
size, which may be positive when evaluated as a long, will get truncated
to an int when passed to load_msg(). So a long might very well contain a
positive value but when truncated to an int it would become negative.

That in combination with a small negative value of msg_ctlmax (which will
be promoted to an unsigned type for the comparison against msgsz, making
it a big positive value and therefore make it pass the check) will lead to
two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too
small buffer as the addition of alen is effectively a subtraction. 2/ The
copy_from_user() call in load_msg() will first overflow the buffer with
userland data and then, when the userland access generates an access
violation, the fixup handler copy_user_handle_tail() will try to fill the
remainder with zeros -- roughly 4GB. That almost instantly results in a
system crash or reset.

,-[ Reproducer (needs to be run as root) ]--
| #include <sys/stat.h>
| #include <sys/msg.h>
| #include <unistd.h>
| #include <fcntl.h>
|
| int main(void) {
| long msg = 1;
| int fd;
|
| fd = open("/proc/sys/kernel/msgmax", O_WRONLY);
| write(fd, "-1", 2);
| close(fd);
|
| msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT);
|
| return 0;
| }
'---

Fix the issue by preventing msgsz from getting truncated by consistently
using size_t for the message length. This way the size checks in
do_msgsnd() could still be passed with a negative value for msg_ctlmax but
we would fail on the buffer allocation in that case and error out.

Also change the type of m_ts from int to size_t to avoid similar nastiness
in other code paths -- it is used in similar constructs, i.e. signed vs.
unsigned checks. It should never become negative under normal
circumstances, though.

Setting msg_ctlmax to a negative value is an odd configuration and should
be prevented. As that might break existing userland, it will be handled
in a separate commit so it could easily be reverted and reworked without
reintroducing the above described bug.

Hardening mechanisms for user copy operations would have catched that bug
early -- e.g. checking slab object sizes on user copy operations as the
usercopy feature of the PaX patch does. Or, for that matter, detect the
long vs. int sign change due to truncation, as the size overflow plugin
of the very same patch does.

[akpm@linux-foundation.org: fix i386 min() warnings]
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> [ v2.3.27+ -- yes, that old ;) ]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

206fa940 12-Nov-2013 Xie XiuQi <xiexiuqi@huawei.com>

ipc/util.c: remove unnecessary work pending test

Remove unnecessary work pending test before calling schedule_work(). It
has been tested in queue_work_on() already. No functional changed.

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b21996e3 20-Sep-2011 J. Bruce Fields <bfields@redhat.com>

locks: break delegations on unlink

We need to break delegations on any operation that changes the set of
links pointing to an inode. Start with unlink.

Such operations also hold the i_mutex on a parent directory. Breaking a
delegation may require waiting for a timeout (by default 90 seconds) in
the case of a unresponsive NFS client. To avoid blocking all directory
operations, we therefore drop locks before waiting for the delegation.
The logic then looks like:

acquire locks
...
test for delegation; if found:
take reference on inode
release locks
wait for delegation break
drop reference on inode
retry

It is possible this could never terminate. (Even if we take precautions
to prevent another delegation being acquired on the same inode, we could
get a different inode on each retry.) But this seems very unlikely.

The initial test for a delegation happens after the lock on the target
inode is acquired, but the directory inode may have been acquired
further up the call stack. We therefore add a "struct inode **"
argument to any intervening functions, which we use to pass the inode
back up to the caller in the case it needs a delegation synchronously
broken.

Cc: David Howells <dhowells@redhat.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9bf76ca3 02-Nov-2013 Mathias Krause <minipli@googlemail.com>

ipc, msg: forbid negative values for "msg{max,mnb,mni}"

Negative message lengths make no sense -- so don't do negative queue
lenghts or identifier counts. Prevent them from getting negative.

Also change the underlying data types to be unsigned to avoid hairy
surprises with sign extensions in cases where those variables get
evaluated in unsigned expressions with bigger data types, e.g size_t.

In case a user still wants to have "unlimited" sizes she could just use
INT_MAX instead.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6e224f94 16-Oct-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: synchronize semop and semctl with IPC_RMID

After acquiring the semlock spinlock, operations must test that the
array is still valid.

- semctl() and exit_sem() would walk stale linked lists (ugly, but
should be ok: all lists are empty)

- semtimedop() would sleep forever - and if woken up due to a signal -
access memory after free.

The patch also:
- standardizes the tests for .deleted, so that all tests in one
function leave the function with the same approach.
- unconditionally tests for .deleted immediately after every call to
sem_lock - even it it means that for semctl(GETALL), .deleted will be
tested twice.

Both changes make the review simpler: After every sem_lock, there must
be a test of .deleted, followed by a goto to the cleanup code (if the
function uses "goto cleanup").

The only exception is semctl_down(): If sem_ids().rwsem is locked, then
the presence in ids->ipcs_idr is equivalent to !.deleted, thus no
additional test is required.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

18ccee26 16-Oct-2013 Davidlohr Bueso <davidlohr@hp.com>

ipc: update locking scheme comments

The initial documentation was a bit incomplete, update accordingly.

[akpm@linux-foundation.org: make it more readable in 80 columns]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4271b05a 30-Sep-2013 Davidlohr Bueso <davidlohr@hp.com>

ipc,msg: prevent race with rmid in msgsnd,msgrcv

This fixes a race in both msgrcv() and msgsnd() between finding the msg
and actually dealing with the queue, as another thread can delete shmid
underneath us if we are preempted before acquiring the
kern_ipc_perm.lock.

Manfred illustrates this nicely:

Assume a preemptible kernel that is preempted just after

msq = msq_obtain_object_check(ns, msqid)

in do_msgrcv(). The only lock that is held is rcu_read_lock().

Now the other thread processes IPC_RMID. When the first task is
resumed, then it will happily wait for messages on a deleted queue.

Fix this by checking for if the queue has been deleted after taking the
lock.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reported-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: <stable@vger.kernel.org> [3.11]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0e8c6656 30-Sep-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: update sem_otime for all operations

In commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the
spinlock section"), the update of semaphore's sem_otime(last semop time)
was moved to one central position (do_smart_update).

But since do_smart_update() is only called for operations that modify
the array, this means that wait-for-zero semops do not update sem_otime
anymore.

The fix is simple:
Non-alter operations must update sem_otime.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Jia He <jiakernel@gmail.com>
Tested-by: Jia He <jiakernel@gmail.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d8c63376 30-Sep-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: synchronize the proc interface

The proc interface is not aware of sem_lock(), it instead calls
ipc_lock_object() directly. This means that simple semop() operations
can run in parallel with the proc interface. Right now, this is
uncritical, because the implementation doesn't do anything that requires
a proper synchronization.

But it is dangerous and therefore should be fixed.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6d07b68c 30-Sep-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize sem_lock()

Operations that need access to the whole array must guarantee that there
are no simple operations ongoing. Right now this is achieved by
spin_unlock_wait(sem->lock) on all semaphores.

If complex_count is nonzero, then this spin_unlock_wait() is not
necessary, because it was already performed in the past by the thread
that increased complex_count and even though sem_perm.lock was dropped
inbetween, no simple operation could have started, because simple
operations cannot start when complex_count is non-zero.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5e9d5275 30-Sep-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix race in sem_lock()

The exclusion of complex operations in sem_lock() is insufficient: after
acquiring the per-semaphore lock, a simple op must first check that
sem_perm.lock is not locked and only after that test check
complex_count. The current code does it the other way around - and that
creates a race. Details are below.

The patch is a complete rewrite of sem_lock(), based in part on the code
from Mike Galbraith. It removes all gotos and all loops and thus the
risk of livelocks.

I have tested the patch (together with the next one) on my i3 laptop and
it didn't cause any problems.

The bug is probably also present in 3.10 and 3.11, but for these kernels
it might be simpler just to move the test of sma->complex_count after
the spin_is_locked() test.

Details of the bug:

Assume:
- sma->complex_count = 0.
- Thread 1: semtimedop(complex op that must sleep)
- Thread 2: semtimedop(simple op).

Pseudo-Trace:

Thread 1: sem_lock(): acquire sem_perm.lock
Thread 1: sem_lock(): check for ongoing simple ops
Nothing ongoing, thread 2 is still before sem_lock().
Thread 1: try_atomic_semop()
<<< preempted.

Thread 2: sem_lock():
static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
int nsops)
{
int locknum;
again:
if (nsops == 1 && !sma->complex_count) {
struct sem *sem = sma->sem_base + sops->sem_num;

/* Lock just the semaphore we are interested in. */
spin_lock(&sem->lock);

/*
* If sma->complex_count was set while we were spinning,
* we may need to look at things we did not lock here.
*/
if (unlikely(sma->complex_count)) {
spin_unlock(&sem->lock);
goto lock_array;
}
<<<<<<<<<
<<< complex_count is still 0.
<<<
<<< Here it is preempted
<<<<<<<<<

Thread 1: try_atomic_semop() returns, notices that it must sleep.
Thread 1: increases sma->complex_count.
Thread 1: drops sem_perm.lock
Thread 2:
/*
* Another process is holding the global lock on the
* sem_array; we cannot enter our critical section,
* but have to wait for the global lock to be released.
*/
if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
spin_unlock(&sem->lock);
spin_unlock_wait(&sma->sem_perm.lock);
goto again;
}
<<< sem_perm.lock already dropped, thus no "goto again;"

locknum = sops->sem_num;

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

53dad6d3 23-Sep-2013 Davidlohr Bueso <davidlohr@hp.com>

ipc: fix race with LSMs

Currently, IPC mechanisms do security and auditing related checks under
RCU. However, since security modules can free the security structure,
for example, through selinux_[sem,msg_queue,shm]_free_security(), we can
race if the structure is freed before other tasks are done with it,
creating a use-after-free condition. Manfred illustrates this nicely,
for instance with shared mem and selinux:

-> do_shmat calls rcu_read_lock()
-> do_shmat calls shm_object_check().
Checks that the object is still valid - but doesn't acquire any locks.
Then it returns.
-> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat)
-> selinux_shm_shmat calls ipc_has_perm()
-> ipc_has_perm accesses ipc_perms->security

shm_close()
-> shm_close acquires rw_mutex & shm_lock
-> shm_close calls shm_destroy
-> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security)
-> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm)
-> ipc_free_security calls kfree(ipc_perms->security)

This patch delays the freeing of the security structures after all RCU
readers are done. Furthermore it aligns the security life cycle with
that of the rest of IPC - freeing them based on the reference counter.
For situations where we need not free security, the current behavior is
kept. Linus states:

"... the old behavior was suspect for another reason too: having the
security blob go away from under a user sounds like it could cause
various other problems anyway, so I think the old code was at least
_prone_ to bugs even if it didn't have catastrophic behavior."

I have tested this patch with IPC testcases from LTP on both my
quad-core laptop and on a 64 core NUMA server. In both cases selinux is
enabled, and tests pass for both voluntary and forced preemption models.
While the mentioned races are theoretical (at least no one as reported
them), I wanted to make sure that this new logic doesn't break anything
we weren't aware of.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

20b8875a 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: drop ipc_lock_check

No remaining users, we now use ipc_obtain_object_check().

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7a25dd9e 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc, shm: drop shm_lock_check

This function was replaced by a the lockless shm_obtain_object_check(),
and no longer has any users.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

32a27500 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: drop ipc_lock_by_ptr

After previous cleanups and optimizations, this function is no longer
heavily used and we don't have a good reason to keep it. Update the few
remaining callers and get rid of it.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

530fcd16 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc, shm: guard against non-existant vma in shmdt(2)

When !CONFIG_MMU there's a chance we can derefence a NULL pointer when the
VM area isn't found - check the return value of find_vma().

Also, remove the redundant -EINVAL return: retval is set to the proper
return code and *only* changed to 0, when we actually unmap the segments.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

05603c44 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: document general ipc locking scheme

As suggested by Andrew, add a generic initial locking scheme used
throughout all sysv ipc mechanisms. Documenting the ids rwsem, how rcu
can be enough to do the initial checks and when to actually acquire the
kern_ipc_perm.lock spinlock.

I found that adding it to util.c was generic enough.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4718787d 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,msg: drop msg_unlock

There is only one user left, drop this function and just call
ipc_unlock_object() and rcu_read_unlock().

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d9a605e4 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: rename ids->rw_mutex

Since in some situations the lock can be shared for readers, we shouldn't
be calling it a mutex, rename it to rwsem.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c2c737a0 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,shm: shorten critical region for shmat

Similar to other system calls, acquire the kern_ipc_perm lock after doing
the initial permission and security checks.

[sasha.levin@oracle.com: dont leave do_shmat with rcu lock held]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f42569b1 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,shm: cleanup do_shmat pasta

Clean up some of the messy do_shmat() spaghetti code, getting rid of
out_free and out_put_dentry labels. This makes shortening the critical
region of this function in the next patch a little easier to do and read.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2caacaa8 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,shm: shorten critical region for shmctl

With the *_INFO, *_STAT, IPC_RMID and IPC_SET commands already optimized,
deal with the remaining SHM_LOCK and SHM_UNLOCK commands. Take the
shm_perm lock after doing the initial auditing and security checks. The
rest of the logic remains unchanged.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c97cb9cc 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,shm: make shmctl_nolock lockless

While the INFO cmd doesn't take the ipc lock, the STAT commands do acquire
it unnecessarily. We can do the permissions and security checks only
holding the rcu lock.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

68eccc1d 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,shm: introduce shmctl_nolock

Similar to semctl and msgctl, when calling msgctl, the *_INFO and *_STAT
commands can be performed without acquiring the ipc object.

Add a shmctl_nolock() function and move the logic of *_INFO and *_STAT out
of msgctl(). Since we are just moving functionality, this change still
takes the lock and it will be properly lockless in the next patch.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3b1c4ad3 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: drop ipcctl_pre_down

Now that sem, msgque and shm, through *_down(), all use the lockless
variant of ipcctl_pre_down(), go ahead and delete it.

[akpm@linux-foundation.org: fix function name in kerneldoc, cleanups]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

79ccf0f8 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,shm: shorten critical region in shmctl_down

Instead of holding the ipc lock for the entire function, use the
ipcctl_pre_down_nolock and only acquire the lock for specific commands:
RMID and SET.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8b8d52ac 11-Sep-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,shm: introduce lockless functions to obtain the ipc object

This is the third and final patchset that deals with reducing the amount
of contention we impose on the ipc lock (kern_ipc_perm.lock). These
changes mostly deal with shared memory, previous work has already been
done for semaphores and message queues:

http://lkml.org/lkml/2013/3/20/546 (sems)
http://lkml.org/lkml/2013/5/15/584 (mqueues)

With these patches applied, a custom shm microbenchmark stressing shmctl
doing IPC_STAT with 4 threads a million times, reduces the execution
time by 50%. A similar run, this time with IPC_SET, reduces the
execution time from 3 mins and 35 secs to 27 seconds.

Patches 1-8: replaces blindly taking the ipc lock for a smarter
combination of rcu and ipc_obtain_object, only acquiring the spinlock
when updating.

Patch 9: renames the ids rw_mutex to rwsem, which is what it already was.

Patch 10: is a trivial mqueue leftover cleanup

Patch 11: adds a brief lock scheme description, requested by Andrew.

This patch:

Add shm_obtain_object() and shm_obtain_object_check(), which will allow us
to get the ipc object without acquiring the lock. Just as with other
forms of ipc, these functions are basically wrappers around
ipc_obtain_object*().

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c7c4591d 07-Sep-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.

The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.

Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users


bebcb928 03-Sep-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/msg.c: Fix lost wakeup in msgsnd().

The check if the queue is full and adding current to the wait queue of
pending msgsnd() operations (ss_add()) must be atomic.

Otherwise:
- the thread that performs msgsnd() finds a full queue and decides to
sleep.
- the thread that performs msgrcv() first reads all messages from the
queue and then sleeps, because the queue is empty.
- the msgrcv() calls do not perform any wakeups, because the msgsnd()
task has not yet called ss_add().
- then the msgsnd()-thread first calls ss_add() and then sleeps.

Net result: msgsnd() and msgrcv() both sleep forever.

Observed with msgctl08 from ltp with a preemptible kernel.

Fix: Call ipc_lock_object() before performing the check.

The patch also moves security_msg_queue_msgsnd() under ipc_lock_object:
- msgctl(IPC_SET) explicitely mentions that it tries to expunge any
pending operations that are not allowed anymore with the new
permissions. If security_msg_queue_msgsnd() is called without locks,
then there might be races.
- it makes the patch much simpler.

Reported-and-tested-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: stable@vger.kernel.org # for 3.11
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c7b96acf 20-Mar-2013 Eric W. Biederman <ebiederm@xmission.com>

userns: Kill nsown_capable it makes the wrong thing easy

nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

368ae537 28-Aug-2013 Svenning Sørensen <sss@secomea.dk>

IPC: bugfix for msgrcv with msgtyp < 0

According to 'man msgrcv': "If msgtyp is less than 0, the first message of
the lowest type that is less than or equal to the absolute value of msgtyp
shall be received."

Bug: The kernel only returns a message if its type is 1; other messages
with type < abs(msgtype) will never get returned.

Fix: After having traversed the list to find the first message with the
lowest type, we need to actually return that message.

This regression was introduced by commit daaf74cf0867 ("ipc: refactor
msg list search into separate function")

Signed-off-by: Svenning Soerensen <sss@secomea.dk>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

758a6ba3 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: rename try_atomic_semop() to perform_atomic_semop(), docu update

Cleanup: Some minor points that I noticed while writing the previous
patches

1) The name try_atomic_semop() is misleading: The function performs the
operation (if it is possible).

2) Some documentation updates.

No real code change, a rename and documentation changes.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d12e1e50 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: replace shared sem_otime with per-semaphore value

sem_otime contains the time of the last semaphore operation that
completed successfully. Every operation updates this value, thus access
from multiple cpus can cause thrashing.

Therefore the patch replaces the variable with a per-semaphore variable.
The per-array sem_otime is only calculated when required.

No performance improvement on a single-socket i3 - only important for
larger systems.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f269f40a 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: always use only one queue for alter operations

There are two places that can contain alter operations:
- the global queue: sma->pending_alter
- the per-semaphore queues: sma->sem_base[].pending_alter.

Since one of the queues must be processed first, this causes an odd
priorization of the wakeups: complex operations have priority over
simple ops.

The patch restores the behavior of linux <=3.0.9: The longest waiting
operation has the highest priority.

This is done by using only one queue:
- if there are complex ops, then sma->pending_alter is used.
- otherwise, the per-semaphore queues are used.

As a side effect, do_smart_update_queue() becomes much simpler: no more
goto logic.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1a82e9e1 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem: separate wait-for-zero and alter tasks into seperate queues

Introduce separate queues for operations that do not modify the
semaphore values. Advantages:

- Simpler logic in check_restart().
- Faster update_queue(): Right now, all wait-for-zero operations are
always tested, even if the semaphore value is not 0.
- wait-for-zero gets again priority, as in linux <=3.0.9

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f5c936c0 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: cacheline align the semaphore structures

As now each semaphore has its own spinlock and parallel operations are
possible, give each semaphore its own cacheline.

On a i3 laptop, this gives up to 28% better performance:

#semscale 10 | grep "interleave 2"
- before:
Cpus 1, interleave 2 delay 0: 36109234 in 10 secs
Cpus 2, interleave 2 delay 0: 55276317 in 10 secs
Cpus 3, interleave 2 delay 0: 62411025 in 10 secs
Cpus 4, interleave 2 delay 0: 81963928 in 10 secs

-after:
Cpus 1, interleave 2 delay 0: 35527306 in 10 secs
Cpus 2, interleave 2 delay 0: 70922909 in 10 secs <<< + 28%
Cpus 3, interleave 2 delay 0: 80518538 in 10 secs
Cpus 4, interleave 2 delay 0: 89115148 in 10 secs <<< + 8.7%

i3, with 2 cores and with hyperthreading enabled. Interleave 2 in order
use first the full cores. HT partially hides the delay from cacheline
trashing, thus the improvement is "only" 8.7% if 4 threads are running.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

196aa013 08-Jul-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/util.c, ipc_rcu_alloc: cacheline align allocation

Enforce that ipc_rcu_alloc returns a cacheline aligned pointer on SMP.

Rationale:

The SysV sem code tries to move the main spinlock into a seperate
cacheline (____cacheline_aligned_in_smp). This works only if
ipc_rcu_alloc returns cacheline aligned pointers. vmalloc and kmalloc
return cacheline algined pointers, the implementation of ipc_rcu_alloc
breaks that.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9ad66ae6 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: remove unused functions

We can now drop the msg_lock and msg_lock_check functions along with a
bogus comment introduced previously in semctl_down.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

41a0d523 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,msg: shorten critical region in msgrcv

do_msgrcv() is the last msg queue function that abuses the ipc lock Take
it only when needed when actually updating msq.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3dd1f784 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,msg: shorten critical region in msgsnd

do_msgsnd() is another function that does too many things with the ipc
object lock acquired. Take it only when needed when actually updating
msq.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ac0ba20e 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,msg: make msgctl_nolock lockless

While the INFO cmd doesn't take the ipc lock, the STAT commands do
acquire it unnecessarily. We can do the permissions and security checks
only holding the rcu lock.

This function now mimics semctl_nolock().

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a5001a0d 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,msg: introduce lockless functions to obtain the ipc object

Add msq_obtain_object() and msq_obtain_object_check(), which will allow
us to get the ipc object without acquiring the lock. Just as with
semaphores, these functions are basically wrappers around
ipc_obtain_object*().

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2cafed30 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,msg: introduce msgctl_nolock

Similar to semctl, when calling msgctl, the *_INFO and *_STAT commands
can be performed without acquiring the ipc object.

Add a msgctl_nolock() function and move the logic of *_INFO and *_STAT
out of msgctl(). This change still takes the lock and it will be
properly lockless in the next patch

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

15724ecb 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,msg: shorten critical region in msgctl_down

Instead of holding the ipc lock for the entire function, use the
ipcctl_pre_down_nolock and only acquire the lock for specific commands:
RMID and SET.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7b4cc5d8 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: move locking out of ipcctl_pre_down_nolock

This function currently acquires both the rw_mutex and the rcu lock on
successful lookups, leaving the callers to explicitly unlock them,
creating another two level locking situation.

Make the callers (including those that still use ipcctl_pre_down())
explicitly lock and unlock the rwsem and rcu lock.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cf9d5d78 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: close open coded spin lock calls

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1ca7003a 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: introduce ipc object locking helpers

Simple helpers around the (kern_ipc_perm *)->lock spinlock.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dbfcd91f 08-Jul-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: move rcu lock out of ipc_addid

This patchset continues the work that began in the sysv ipc semaphore
scaling series, see

https://lkml.org/lkml/2013/3/20/546

Just like semaphores used to be, sysv shared memory and msg queues also
abuse the ipc lock, unnecessarily holding it for operations such as
permission and security checks.

This patchset mostly deals with mqueues, and while shared mem can be
done in a very similar way, I want to get these patches out in the open
first. It also does some pending cleanups, mostly focused on the two
level locking we have in ipc code, taking care of ipc_addid() and
ipcctl_pre_down_nolock() - yes there are still functions that need to be
updated as well.

This patch:

Make all callers explicitly take and release the RCU read lock.

This addresses the two level locking seen in newary(), newseg() and
newqueue(). For the last two, explicitly unlock the ipc object and the
rcu lock, instead of calling the custom shm_unlock and msg_unlock
functions. The next patch will deal with the open coded locking for
->perm.lock

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c103a4dc 08-Jul-2013 Andrew Morton <akpm@linux-foundation.org>

ipc/shmc.c: eliminate ugly 80-col tricks

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

79f6530c 08-Jul-2013 Jeff Layton <jlayton@kernel.org>

audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names record

The old audit PATH records for mq_open looked like this:

type=PATH msg=audit(1366282323.982:869): item=1 name=(null) inode=6777
dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
obj=system_u:object_r:tmpfs_t:s15:c0.c1023
type=PATH msg=audit(1366282323.982:869): item=0 name="test_mq" inode=26732
dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023

...with the audit related changes that went into 3.7, they now look like this:

type=PATH msg=audit(1366282236.776:3606): item=2 name=(null) inode=66655
dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023
type=PATH msg=audit(1366282236.776:3606): item=1 name=(null) inode=6926
dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
obj=system_u:object_r:tmpfs_t:s15:c0.c1023
type=PATH msg=audit(1366282236.776:3606): item=0 name="test_mq"

Both of these look wrong to me. As Steve Grubb pointed out:

"What we need is 1 PATH record that identifies the MQ. The other PATH
records probably should not be there."

Fix it to record the mq root as a parent, and flag it such that it
should be hidden from view when the names are logged, since the root of
the mq filesystem isn't terribly interesting. With this change, we get
a single PATH record that looks more like this:

type=PATH msg=audit(1368021604.836:484): item=0 name="test_mq" inode=16914
dev=00:0c mode=0100644 ouid=0 ogid=0 rdev=00:00
obj=unconfined_u:object_r:user_tmpfs_t:s0

In order to do this, a new audit_inode_parent_hidden() function is
added. If we do it this way, then we avoid having the existing callers
of audit_inode needing to do any sort of flag conversion if auditing is
inactive.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Jiri Jaburek <jjaburek@redhat.com>
Cc: Steve Grubb <sgrubb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ab465df9 26-May-2013 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: Fix missing wakeups in do_smart_update_queue()

do_smart_update_queue() is called when an operation (semop,
semctl(SETVAL), semctl(SETALL), ...) modified the array. It must check
which of the sleeping tasks can proceed.

do_smart_update_queue() missed a few wakeups:
- if a sleeping complex op was completed, then all per-semaphore queues
must be scanned - not only those that were modified by *sops
- if a sleeping simple op proceeded, then the global queue must be
scanned again

And:
- the test for "|sops == NULL) before scanning the global queue is not
required: If the global queue is empty, then it doesn't need to be
scanned - regardless of the reason for calling do_smart_update_queue()

The patch is not optimized, i.e. even completing a wait-for-zero
operation causes a rescan. This is done to keep the patch as simple as
possible.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

091d0d55 09-May-2013 Li Zefan <lizefan@huawei.com>

shm: fix null pointer deref when userspace specifies invalid hugepage size

Dave reported an oops triggered by trinity:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: newseg+0x10d/0x390
PGD cf8c1067 PUD cf8c2067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67
...
Call Trace:
ipcget+0x182/0x380
SyS_shmget+0x5a/0x60
tracesys+0xdd/0xe2

This bug was introduced by commit af73e4d9506d ("hugetlbfs: fix mmap
failure in unaligned size request").

Reported-by: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Li Zefan <lizfan@huawei.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

de2657f9 09-May-2013 Rik van Riel <riel@redhat.com>

ipc,sem: fix semctl(..., GETNCNT)

The semctl GETNCNT returns the number of semops waiting for the
specified semaphore to become nonzero. After commit 9f1bc2c9022c
("ipc,sem: have only one list in struct sem_queue"), the semops waiting
on just one semaphore are waiting on that semaphore's list.

In order to return the correct count, we have to walk that list too, in
addition to the sem_array's list for complex operations.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ebc2e5e6 09-May-2013 Rik van Riel <riel@redhat.com>

ipc,sem: fix semctl(..., GETZCNT)

The semctl GETZCNT returns the number of semops waiting for the
specified semaphore to become zero. After commit 9f1bc2c9022c
("ipc,sem: have only one list in struct sem_queue"), the semops waiting
on just one semaphore are waiting on that semaphore's list.

In order to return the correct count, we have to walk that list too, in
addition to the sem_array's list for complex operations.

This bug broke dbench; it works again with this patch applied.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Kent Overstreet <koverstreet@google.com>
Tested-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

af73e4d9 07-May-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

hugetlbfs: fix mmap failure in unaligned size request

The current kernel returns -EINVAL unless a given mmap length is
"almost" hugepage aligned. This is because in sys_mmap_pgoff() the
given length is passed to vm_mmap_pgoff() as it is without being aligned
with hugepage boundary.

This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
alignment of huge page requests"), where alignment code is pushed into
hugetlb_file_setup() and the variable len in caller side is not changed.

To fix this, this patch partially reverts that commit, and adds
alignment code in caller side. And it also introduces hstate_sizelog()
in order to get proper hstate to specified hugepage size.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

[akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: <iceman_dvd@yahoo.com>
Cc: Steven Truelove <steven.truelove@utoronto.ca>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

941b0304 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: simplify rcu_read_lock() in semctl_nolock()

This trivially combines two rcu_read_lock() calls in both sides of a
if-statement into one single one in front of the if-statement.

Split out as an independent cleanup from the previous commit.

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c728b9c8 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: simplify semtimedop/semctl_main() common error path handling

With various straight RCU lock/unlock movements, one common exit path
pattern had become

rcu_read_unlock();
goto out_wakeup;

and in fact there were no cases where we wanted to exit to out_wakeup
_without_ releasing the RCU read lock.

So replace that pattern with "goto out_rcu_wakeup", and remove the old
out_wakeup.

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

321310ce 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: move sem_obtain_lock() rcu locking into the only caller

sem_obtain_lock() was another of those functions that returned with the
RCU lock held for reading in the success case. Move the RCU locking to
the caller (semtimedop()), making it more obvious. We already did RCU
locking elsewhere in that function.

Side note: why does semtimedop() re-do the semphore lookup after the
sleep, rather than just getting a reference to the semaphore it already
looked up originally?

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fbfd1d28 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: fix double sem unlock in semctl error path

Fix another ipc locking buglet introduced by the scalability patches:
when semctl_down() was changed to delay the semaphore locking, one error
path for security_sem_semctl() went through the semaphore unlock logic
even though the semaphore had never been locked.

Introduced by commit 16df3674efe3 ("ipc,sem: do not hold ipc lock more
than necessary")

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4091fd94 04-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: move the rcu_read_lock() from sem_lock_and_putref() into callers

This is another ipc semaphore locking cleanup, trying to make the
locking more straightforward. We move the rcu read locking into the
callers of sem_lock_and_putref(), which in general means that we now
mostly do the rcu_read_lock() and rcu_read_unlock() in the same
function.

Mostly. We still have the ipc_addid/newary/freeary mess, and things
like ipcctl_pre_down_nolock().

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

73b29505 03-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: sem_putref() does not need the semaphore lock any more

ipc_rcu_putref() uses atomics for the refcount, and the games to lock
and unlock the semaphore just to try to keep the reference counting
working are no longer useful.

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6d49dab8 03-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

ipc: move rcu_read_unlock() out of sem_unlock() and into callers

The IPC locking is a mess, and sem_unlock() unlocks not only the
semaphore spinlock, it also drops the rcu read lock. Unlike sem_lock(),
which just gets the spin-lock, and expects the caller to get the rcu
read lock.

This all makes things very hard to follow, and it's very confusing when
you take the rcu read lock in one function, and then release it in
another. And it has caused actual bugs: the sem_obtain_lock() function
ended up dropping the RCU read lock twice in one error path, because it
first did the sem_unlock(), and then did a rcu_read_unlock() to match
the rcu_read_lock() it had done.

This is just a totally mindless "remove rcu_read_unlock() from
sem_unlock() and add it immediately after each caller" (except for the
aforementioned bug where we did too many rcu_read_unlock(), and in
find_alloc_undo() where we just got the rcu_read_lock() to correct for
the fact that sem_unlock would immediately drop it again).

We can (and should) clean things up further, but this fixes the bug with
the minimal amount of subtlety.

Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ce857229 02-May-2013 Al Viro <viro@ZenIV.linux.org.uk>

ipc: fix GETALL/IPC_RM race for sysv semaphores

We can step on WARN_ON_ONCE() in sem_getref() if a semaphore is removed
just as we are about to call sem_getref() from semctl_main(); results
are not pretty.

We should fail with -EIDRM, same as if IPC_RM happened while we'd been
doing allocation there. This also expands sem_getref() at its only
callsite (and fixed there), while sem_getref_and_unlock() is simply
killed off - it has no callers at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

600fe975 28-Apr-2013 Al Viro <viro@zeniv.linux.org.uk>

ipc_schedule_free() can do vfree() directly now

Commit 32fcfd40715e ("make vfree() safe to call from interrupt
contexts") made it safe to do vfree directly from the RCU callback,
which allows us to simplify ipc/util.c a lot by getting rid of the
differences between vmalloc/kmalloc memory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

20b4fb48 01-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...


0bb80f24 11-Apr-2013 David Howells <dhowells@redhat.com>

proc: Split the namespace stuff out into linux/proc_ns.h

Split the proc namespace stuff out into linux/proc_ns.h.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d69f3bad 30-Apr-2013 Robin Holt <holt@sgi.com>

ipc: sysv shared memory limited to 8TiB

Trying to run an application which was trying to put data into half of
memory using shmget(), we found that having a shmall value below 8EiB-8TiB
would prevent us from using anything more than 8TiB. By setting
kernel.shmall greater than 8EiB-8TiB would make the job work.

In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX.

ipc/shm.c:
458 static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
459 {
...
465 int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
...
474 if (ns->shm_tot + numpages > ns->shm_ctlall)
475 return -ENOSPC;

[akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int]
Signed-off-by: Robin Holt <holt@sgi.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

41239fe8 30-Apr-2013 Nikola Pajkovsky <npajkovs@redhat.com>

ipc/msg.c: use list_for_each_entry_[safe] for list traversing

The ipc/msg.c code does its list operations by hand and it open-codes the
accesses, instead of using for_each_entry_[safe].

Signed-off-by: Nikola Pajkovsky <npajkovs@redhat.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6062a8dc 30-Apr-2013 Rik van Riel <riel@surriel.com>

ipc,sem: fine grained locking for semtimedop

Introduce finer grained locking for semtimedop, to handle the common case
of a program wanting to manipulate one semaphore from an array with
multiple semaphores.

If the call is a semop manipulating just one semaphore in an array with
multiple semaphores, only take the lock for that semaphore itself.

If the call needs to manipulate multiple semaphores, or another caller is
in a transaction that manipulates multiple semaphores, the sem_array lock
is taken, as well as all the locks for the individual semaphores.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

vanilla Davidlohr's Davidlohr's + Davidlohr's +
threads patches rwlock patches v3 patches
10 610652 726325 1783589 2142206
20 341570 365699 1520453 1977878
30 288102 307037 1498167 2037995
40 290714 305955 1612665 2256484
50 288620 312890 1733453 2650292
60 289987 306043 1649360 2388008
70 291298 306347 1723167 2717486
80 290948 305662 1729545 2763582
90 290996 306680 1736021 2757524
100 292243 306700 1773700 3059159

[davidlohr.bueso@hp.com: do not call sem_lock when bogus sma]
[davidlohr.bueso@hp.com: make refcounter atomic]
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Emmanuel Benisty <benisty.e@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9f1bc2c9 30-Apr-2013 Rik van Riel <riel@surriel.com>

ipc,sem: have only one list in struct sem_queue

Having only one list in struct sem_queue, and only queueing simple
semaphore operations on the list for the semaphore involved, allows us to
introduce finer grained locking for semtimedop.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c460b662 30-Apr-2013 Rik van Riel <riel@surriel.com>

ipc,sem: open code and rename sem_lock

Rename sem_lock() to sem_obtain_lock(), so we can introduce a sem_lock()
later that only locks the sem_array and does nothing else.

Open code the locking from ipc_lock() in sem_obtain_lock() so we can
introduce finer grained locking for the sem_array in the next patch.

[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno out of sem_obtain_lock()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

16df3674 30-Apr-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc,sem: do not hold ipc lock more than necessary

Instead of holding the ipc lock for permissions and security checks, among
others, only acquire it when necessary.

Some numbers....

1) With Rik's semop-multi.c microbenchmark we can see the following
results:

Baseline (3.9-rc1):
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151452270, ops/sec 5048409

+ 59.40% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 6.14% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 3.84% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 3.64% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 2.06% a.out [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 1.86% a.out [kernel.kallsyms] [k] ipc_lock

With this patchset:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 273156400, ops/sec 9105213

+ 18.54% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 11.72% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 7.70% a.out [kernel.kallsyms] [k] ipc_has_perm.isra.21
+ 6.58% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 6.54% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 4.71% a.out [kernel.kallsyms] [k] ipc_obtain_object_check

2) While on an Oracle swingbench DSS (data mining) workload the
improvements are not as exciting as with Rik's benchmark, we can see
some positive numbers. For an 8 socket machine the following are the
percentages of %sys time incurred in the ipc lock:

Baseline (3.9-rc1):
100 swingbench users: 8,74%
400 swingbench users: 21,86%
800 swingbench users: 84,35%

With this patchset:
100 swingbench users: 8,11%
400 swingbench users: 19,93%
800 swingbench users: 77,69%

[riel@redhat.com: fix two locking bugs]
[sasha.levin@oracle.com: prevent releasing RCU read lock twice in semctl_main]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

444d0f62 30-Apr-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: introduce lockless pre_down ipcctl

Various forms of ipc use ipcctl_pre_down() to retrieve an ipc object and
check permissions, mostly for IPC_RMID and IPC_SET commands.

Introduce ipcctl_pre_down_nolock(), a lockless version of this function.
The locking version is retained, yet modified to call the nolock version
without affecting its semantics, thus transparent to all ipc callers.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4d2bff5e 30-Apr-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: introduce obtaining a lockless ipc object

Through ipc_lock() and therefore ipc_lock_check() we currently return the
locked ipc object. This is not necessary for all situations and can,
therefore, cause unnecessary ipc lock contention.

Introduce analogous ipc_obtain_object() and ipc_obtain_object_check()
functions that only lookup and return the ipc object.

Both these functions must be called within the RCU read critical section.

[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7bb4deff 30-Apr-2013 Davidlohr Bueso <davidlohr.bueso@hp.com>

ipc: remove bogus lock comment for ipc_checkid

This series makes the sysv semaphore code more scalable, by reducing the
time the semaphore lock is held, and making the locking more scalable for
semaphore arrays with multiple semaphores.

The first four patches were written by Davidlohr Buesso, and reduce the
hold time of the semaphore lock.

The last three patches change the sysv semaphore code locking to be more
fine grained, providing a performance boost when multiple semaphores in a
semaphore array are being manipulated simultaneously.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

vanilla Davidlohr's Davidlohr's + Davidlohr's +
threads patches rwlock patches v3 patches
10 610652 726325 1783589 2142206
20 341570 365699 1520453 1977878
30 288102 307037 1498167 2037995
40 290714 305955 1612665 2256484
50 288620 312890 1733453 2650292
60 289987 306043 1649360 2388008
70 291298 306347 1723167 2717486
80 290948 305662 1729545 2763582
90 290996 306680 1736021 2757524
100 292243 306700 1773700 3059159

This patch:

There is no reason to be holding the ipc lock while reading ipcp->seq,
hence remove misleading comment.

Also simplify the return value for the function.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1e3c941c 30-Apr-2013 HoSung Jung <rain6557@gmail.com>

ipc/msgutil.c: use linux/uaccess.h

Signed-off-by: HoSung Jung <rain6557@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

daaf74cf 30-Apr-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: refactor msg list search into separate function

[fengguang.wu@intel.com: find_msg can be static]
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d076ac91 30-Apr-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: simplify msg list search

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8ac6ed58 30-Apr-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: implement MSG_COPY as a new receive mode

Teach the helper routines about MSG_COPY so that msgtyp is preserved as
the message number to copy.

The security functions affected by this change were audited and no
additional changes are necessary.

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

852028af 30-Apr-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: remove msg handling from queue scan

In preparation for refactoring the queue scan into a separate
function, relocate msg copying.

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2b3097a2 30-Apr-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: set EFAULT as default error in load_msg()

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

da085d45 30-Apr-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: tighten msg copy loops

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

be5f4b33 30-Apr-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: separate msg allocation from userspace copy

Separating msg allocation enables single-block vmalloc
allocation instead.

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3d8fa456 30-Apr-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: clamp with min()

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

08d76760 01-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions


8f68fa2d 29-Apr-2013 Andrew Morton <akpm@linux-foundation.org>

ipc/util.c: use register_hotmemory_notifier()

Squishes a statement-with-no-effect warning, removes some ifdefs and
shrinks .text by one byte!

Note that this code fails to check for blocking_notifier_chain_register()
failures.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d9dda78b 31-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

procfs: new helper - PDE_DATA(inode)

The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data. Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

2dc958fa 01-Apr-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: set msg back to -EAGAIN if copy wasn't performed

Make sure that msg pointer is set back to error value in case of
MSG_COPY flag is set and desired message to copy wasn't found. This
garantees that msg is either a error pointer or a copy address.

Otherwise the last message in queue will be freed without unlinking from
the queue (which leads to memory corruption) and the dummy allocated
copy won't be released.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2c3de1c2 28-Mar-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull userns fixes from Eric W Biederman:
"The bulk of the changes are fixing the worst consequences of the user
namespace design oversight in not considering what happens when one
namespace starts off as a clone of another namespace, as happens with
the mount namespace.

The rest of the changes are just plain bug fixes.

Many thanks to Andy Lutomirski for pointing out many of these issues."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Restrict when proc and sysfs can be mounted
ipc: Restrict mounting the mqueue filesystem
vfs: Carefully propogate mounts across user namespaces
vfs: Add a mount flag to lock read only bind mounts
userns: Don't allow creation if the user is chrooted
yama: Better permission check for ptraceme
pid: Handle the exit of a multi-threaded init.
scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.


a636b702 21-Mar-2013 Eric W. Biederman <ebiederm@xmission.com>

ipc: Restrict mounting the mqueue filesystem

Only allow mounting the mqueue filesystem if the caller has CAP_SYS_ADMIN
rights over the ipc namespace. The principle here is if you create
or have capabilities over it you can mount it, otherwise you get to live
with what other people have mounted.

This information is not particularly sensitive and mqueue essentially
only reports which posix messages queues exist. Still when creating a
restricted environment for an application to live any extra
information may be of use to someone with sufficient creativity. The
historical if imperfect way this information has been restricted has
been not to allow mounts and restricting this to ipc namespace
creators maintains the spirit of the historical restriction.

Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

38d78e58 22-Mar-2013 Vladimir Davydov <vdavydov.dev@gmail.com>

mqueue: sys_mq_open: do not call mnt_drop_write() if read-only

mnt_drop_write() must be called only if mnt_want_write() succeeded,
otherwise the mnt_writers counter will diverge.

mnt_writers counters are used to check if remounting FS as read-only is
OK, so after an extra mnt_drop_write() call, it would be impossible to
remount mqueue FS as read-only. Besides, on umount a warning would be
printed like this one:

=====================================
[ BUG: bad unlock balance detected! ]
3.9.0-rc3 #5 Not tainted
-------------------------------------
a.out/12486 is trying to release lock (sb_writers) at:
mnt_drop_write+0x1f/0x30
but there are no more locks to release!

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

88b9e456 08-Mar-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: don't allocate a copy larger than max

When MSG_COPY is set, a duplicate message must be allocated for the copy
before locking the queue. However, the copy could not be larger than was
sent which is limited to msg_ctlmax.

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e1082f45 08-Mar-2013 Peter Hurley <peter@hurleysoftware.com>

ipc: fix potential oops when src msg > 4k w/ MSG_COPY

If the src msg is > 4k, then dest->next points to the
next allocated segment; resetting it just prior to dereferencing
is bad.

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e1fd1f49 05-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

get rid of union semop in sys_semctl(2) arguments

just have the bugger take unsigned long and deal with SETVAL
case (when we use an int member in the union) explicitly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0e65a81b 03-Feb-2013 Al Viro <viro@zeniv.linux.org.uk>

get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

56e41d3c 21-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

merge compat sys_ipc instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

22d1a35d 21-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

make HAVE_SYSCALL_WRAPPERS unconditional

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

54924ea3 27-Feb-2013 Tejun Heo <tj@kernel.org>

ipc: convert to idr_alloc()

Convert to the much saner new idr interface.

The new interface doesn't directly translate to the way idr_pre_get()
was used around ipc_addid() as preloading disables preemption. From
my cursory reading, it seems like we should be able to do all
allocation from ipc_addid(), so I moved it there. Can you please
check whether this would be okay? If this is wrong and ipc_addid()
should be allowed to be called from non-sleepable context, I'd suggest
allocating id itself in the outer functions and later install the
pointer using idr_replace().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d895cb1a 26-Feb-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.

The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.

Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.

PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...


94f2f142 25-Feb-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.

I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.

There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.

The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.

XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...


41badc15 22-Feb-2013 Michel Lespinasse <michel@lespinasse.org>

mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool

do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
multiple, however there was no equivalent code in mm_populate(), which
caused issues.

This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bebeb3d6 22-Feb-2013 Michel Lespinasse <michel@lespinasse.org>

mm: introduce mm_populate() for populating new vmas

When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.

This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().

Signed-off-by: Michel Lespinasse <walken@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

39b65252 12-Sep-2012 Anatol Pomozov <anatol.pomozov@gmail.com>

fs: Preserve error code in get_empty_filp(), part 2

Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
- not enough memory for file structures
- operation is not allowed
- user is over its limit

Currently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.

Return error through pointer. Change all callers to preserve this error code.

[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

bc1b69ed 27-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com>

userns: Allow the unprivileged users to mount mqueue fs

This patch allow the unprivileged user to mount mqueuefs in
user ns.

If two userns share the same ipcns,the files in mqueue fs
should be seen in both these two userns.

If the userns has its own ipcns,it has its own mqueue fs too.
ipcns has already done this job well.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

3fcfe786 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: add more comments to message copying related code

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

51eeacaa 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: simplify message copying

Remove the redundant and confusing fill_copy(). Also add copy_msg()
check for error. In this case exit from the function have to be done
instead of break, because further code interprets any error as EAGAIN.

Also define copy_msg() for the case when CONFIG_CHECKPOINT_RESTORE is
disabled.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b30efe27 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: convert prepare_copy() from macro to function

This code works if CONFIG_CHECKPOINT_RESTORE is disabled.

[akpm@linux-foundation.org: remove __maybe_unused]
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

85398aa8 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: simplify free_copy() call

Passing and checking of msgflg to free_copy() is redundant. This patch
sets copy to NULL on declaration instead and checks for non-NULL in
free_copy().

Note: in case of copy allocation failure, error is returned immediately.
So no need to check for IS_ERR() in free_copy().

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3a665531 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

selftests: IPC message queue copy feature test

This test can be used to check wheither kernel supports IPC message queue
copy and restore features (required by CRIU project).

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4a674f34 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: introduce message queue copy feature

This patch is required for checkpoint/restore in userspace.

c/r requires some way to get all pending IPC messages without deleting
them from the queue (checkpoint can fail and in this case tasks will be
resumed, so queue have to be valid).

To achive this, new operation flag MSG_COPY for sys_msgrcv() system call
was introduced. If this flag was specified, then mtype is interpreted as
number of the message to copy.

If MSG_COPY is set, then kernel will allocate dummy message with passed
size, and then use new copy_msg() helper function to copy desired message
(instead of unlinking it from the queue).

Notes:

1) Return -ENOSYS if MSG_COPY is specified, but
CONFIG_CHECKPOINT_RESTORE is not set.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f9dd87f4 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: message queue receive cleanup

Move all message related manipulation into one function msg_fill().
Actually, two functions because of the compat one.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

03f59566 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: add sysctl to specify desired next object id

Add 3 new variables and sysctls to tune them (by one "next_id" variable
for messages, semaphores and shared memory respectively). This variable
can be used to set desired id for next allocated IPC object. By default
it's equal to -1 and old behaviour is preserved. If this variable is
non-negative, then desired idr will be extracted from it and used as a
start value to search for free IDR slot.

Notes:

1) this patch doesn't guarantee that the new object will have desired
id. So it's up to user space how to handle new object with wrong id.

2) After a sucessful id allocation attempt, "next_id" will be set back
to -1 (if it was non-negative).

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9afdacda 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: remove forced assignment of selected message

This is a cleanup patch. The assignment is redundant.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a2faf2fc 18-Dec-2012 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull (again) user namespace infrastructure changes from Eric Biederman:
"Those bugs, those darn embarrasing bugs just want don't want to get
fixed.

Linus I just updated my mirror of your kernel.org tree and it appears
you successfully pulled everything except the last 4 commits that fix
those embarrasing bugs.

When you get a chance can you please repull my branch"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Fix typo in description of the limitation of userns_install
userns: Add a more complete capability subset test to commit_creds
userns: Require CAP_SYS_ADMIN for most uses of setns.
Fix cap_capable to only allow owners in the parent user namespace to have caps.


6a2b60b1 17-Dec-2012 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull user namespace changes from Eric Biederman:
"While small this set of changes is very significant with respect to
containers in general and user namespaces in particular. The user
space interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.
The tyranny of supporting suid root preventing unprivileged users from
using cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications. Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out
tendrils into all kinds of unexpected cleanup paths for operation. At
least one case of broken error handling is fixed by this cleanup.

The files under /proc/<pid>/ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS,
ensuring the files always refer to the namespace the process is
currently using and ensuring that the ptrace_mayaccess permission
checks are always applied.

The files under /proc/<pid>/ns/ have been given stable inode numbers
so it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user
namespace root to usefully use the networking stack. Similar changes
for the mount namespace and the pid namespace are coming through my
tree.

Two small changes to add user namespace support were commited here adn
in David Miller's -net tree so that I could complete the work on the
/proc/<pid>/ns/ files in this tree.

Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
Kconfig guard remains in place preventing that user namespaces from
being built when any of those filesystems are enabled.

Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
proc: Usable inode numbers for the namespace file descriptors.
proc: Fix the namespace inode permission checks.
proc: Generalize proc inode allocation
userns: Allow unprivilged mounts of proc and sysfs
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
procfs: Print task uids and gids in the userns that opened the proc file
userns: Implement unshare of the user namespace
userns: Implent proc namespace operations
userns: Kill task_user_ns
userns: Make create_new_namespaces take a user_ns parameter
userns: Allow unprivileged use of setns.
userns: Allow unprivileged users to create new namespaces
userns: Allow setting a userns mapping to your current uid.
userns: Allow chown and setgid preservation
userns: Allow unprivileged users to create user namespaces.
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: fix return value on mntns_install() failure
vfs: Allow unprivileged manipulation of the mount namespace.
vfs: Only support slave subtrees across different user namespaces
vfs: Add a user namespace reference from struct mnt_namespace
...


5e4a0847 14-Dec-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Require CAP_SYS_ADMIN for most uses of setns.

Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

42d7395f 11-Dec-2012 Andi Kleen <ak@linux.intel.com>

mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB

There was some desire in large applications using MAP_HUGETLB or
SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
others. This is useful together with NUMA policy: use 2MB interleaving
on some mappings, but 1GB on local mappings.

This patch extends the IPC/SHM syscall interfaces slightly to allow
specifying the page size.

It borrows some upper bits in the existing flag arguments and allows
encoding the log of the desired page size in addition to the *_HUGETLB
flag. When 0 is specified the default size is used, this makes the
change fully compatible.

Extending the internal hugetlb code to handle this is straight forward.
Instead of a single mount it just keeps an array of them and selects the
right mount based on the specified page size. When no page size is
specified it uses the mount of the default page size.

The change is not visible in /proc/mounts because internal mounts don't
appear there. It also has very little overhead: the additional mounts
just consume a super block, but not more memory when not used.

I also exported the new flags to the user headers (they were previously
under __KERNEL__). Right now only symbols for x86 and some other
architecture for 1GB and 2MB are defined. The interface should already
work for all other architectures though. Only architectures that define
multiple hugetlb sizes actually need it (that is currently x86, tile,
powerpc). However tile and powerpc have user configurable hugetlb
sizes, so it's not easy to add defines. A program on those
architectures would need to query sysfs and use the appropiate log2.

[akpm@linux-foundation.org: cleanups]
[rientjes@google.com: fix build]
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

98f842e6 15-Jun-2011 Eric W. Biederman <ebiederm@xmission.com>

proc: Usable inode numbers for the namespace file descriptors.

Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.

A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.

This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.

We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.

I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

bcf58e72 26-Jul-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Make create_new_namespaces take a user_ns parameter

Modify create_new_namespaces to explicitly take a user namespace
parameter, instead of implicitly through the task_struct.

This allows an implementation of unshare(CLONE_NEWUSER) where
the new user namespace is not stored onto the current task_struct
until after all of the namespaces are created.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

142e1d1d 26-Jul-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Allow unprivileged use of setns.

- Push the permission check from the core setns syscall into
the setns install methods where the user namespace of the
target namespace can be determined, and used in a ns_capable
call.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

adb5c247 10-Oct-2012 Jeff Layton <jlayton@kernel.org>

audit: make audit_inode take struct filename

Keep a pointer to the audit_names "slot" in struct filename.

Have all of the audit_inode callers pass a struct filename ponter to
audit_inode instead of a string pointer. If the aname field is already
populated, then we can skip walking the list altogether and just use it
directly.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

91a27b2a 10-Oct-2012 Jeff Layton <jlayton@kernel.org>

vfs: define struct filename and have getname() return it

getname() is intended to copy pathname strings from userspace into a
kernel buffer. The result is just a string in kernel space. It would
however be quite helpful to be able to attach some ancillary info to
the string.

For instance, we could attach some audit-related info to reduce the
amount of audit-related processing needed. When auditing is enabled,
we could also call getname() on the string more than once and not
need to recopy it from userspace.

This patchset converts the getname()/putname() interfaces to return
a struct instead of a string. For now, the struct just tracks the
string in kernel space and the original userland pointer for it.

Later, we'll add other information to the struct as it becomes
convenient.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

bfcec708 10-Oct-2012 Jeff Layton <jlayton@kernel.org>

audit: set the name_len in audit_inode for parent lookups

Currently, this gets set mostly by happenstance when we call into
audit_inode_child. While that might be a little more efficient, it seems
wrong. If the syscall ends up failing before audit_inode_child ever gets
called, then you'll have an audit_names record that shows the full path
but has the parent inode info attached.

Fix this by passing in a parent flag when we call audit_inode that gets
set to the value of LOOKUP_PARENT. We can then fix up the pathname for
the audit entry correctly from the get-go.

While we're at it, clean up the no-op macro for audit_inode in the
!CONFIG_AUDITSYSCALL case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1638113d 08-Oct-2012 Michel Lespinasse <michel@lespinasse.org>

ipc/mqueue: remove unnecessary rb_init_node() calls

Commit d6629859b36d ("ipc/mqueue: improve performance of send/recv") and
ce2d52cc ("ipc/mqueue: add rbtree node caching support") introduced an
rbtree of message priorities, and usage of rb_init_node() to initialize
the corresponding nodes. As it turns out, rb_init_node() is unnecessary
here, as the nodes are fully initialized on insertion by rb_link_node()
and the code doesn't access nodes that aren't inserted on the rbtree.

Removing the rb_init_node() calls as I removed that function during
rbtree API cleanups (the only other use of it was in a place that
similarly didn't require it).

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

aab174f0 02-Oct-2012 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs update from Al Viro:

- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c

(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).

A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.

- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).

- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.

- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.

- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...


437589a7 02-Oct-2012 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.

The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.

The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.

Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.

Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.

While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.

Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.

After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...


2903ff01 27-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

switch simple cases of fget_light to fdget

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

515e0d66 27-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

switch mqueue syscalls to fget_light()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1efdb69b 07-Feb-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Convert ipc to use kuid and kgid where appropriate

- Store the ipc owner and creator with a kuid
- Store the ipc group and the crators group with a kgid.
- Add error handling to ipc_update_perms, allowing it to
fail if the uids and gids can not be converted to kuids
or kgids.
- Modify the proc files to display the ipc creator and
owner in the user namespace of the opener of the proc file.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

312b90fb 06-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

mqueue: lift mnt_want_write() outside ->i_mutex, clean up a bit

the way it abuses ->d_fsdata still needs to be killed, but that's
a separate story.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c1d7e01d 30-Jul-2012 Will Deacon <will@kernel.org>

ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION

Rather than #define the options manually in the architecture code, add
Kconfig options for them and select them there instead. This also allows
us to select the compat IPC version parsing automatically for platforms
using the old compat IPC interface.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

05ba3f1a 30-Jul-2012 Will Deacon <will@kernel.org>

ipc: compat: use signed size_t types for msgsnd and msgrcv

The msgsnd and msgrcv system calls use size_t to represent the size of the
message being transferred. POSIX states that values of msgsz greater than
SSIZE_MAX cause the result to be implementation-defined. On Linux, this
equates to returning -EINVAL if (long) msgsz < 0.

For compat tasks where !CONFIG_ARCH_WANT_OLD_COMPAT_IPC and compat_size_t
is smaller than size_t, negative size values passed from userspace will be
interpreted as positive values by do_msg{rcv,snd} and will fail to exit
early with -EINVAL.

This patch changes the compat prototypes for msg{rcv,snd} so that the
message size is represented as a compat_ssize_t, which we cast to the
native ssize_t type for the core IPC code.

Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b610c04c 30-Jul-2012 Will Deacon <will@kernel.org>

ipc: allow compat IPC version field parsing if !ARCH_WANT_OLD_COMPAT_IPC

Commit 48b25c43e6ee ("ipc: provide generic compat versions of IPC
syscalls") added a new ARCH_WANT_OLD_COMPAT_IPC config option for
architectures to select if their compat target requires the old IPC
syscall interface.

For architectures (such as AArch64) that do not require the internal
calling conventions provided by this option, but have a compat target
where the C library passes the IPC_64 flag explicitly,
compat_ipc_parse_version no longer strips out the flag before calling
the native system call implementation, resulting in unknown SHM/IPC
commands and -EINVAL being returned to userspace.

This patch separates the selection of the internal calling conventions
for the IPC syscalls from the version parsing, allowing architectures to
select __ARCH_WANT_COMPAT_IPC_PARSE_VERSION if they want to use version
parsing whilst retaining the newer syscall calling conventions.

Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

079a96ae 30-Jul-2012 Will Deacon <will@kernel.org>

ipc: add COMPAT_SHMLBA support

If the SHMLBA definition for a native task differs from the definition for
a compat task, the do_shmat() function would need to handle both.

This patch introduces COMPAT_SHMLBA, which is used by the compat shmat
syscall when calling the ipc code and allows architectures such as AArch64
(where the native SHMLBA is 64k but the compat (AArch32) definition is
16k) to provide the correct semantics for compat IPC system calls.

Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

765927b2 26-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

switch dentry_open() to struct path, make it grab references itself

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

312b63fb 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

don't pass nameidata * to vfs_create()

all we want is a boolean flag, same as the method gets now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ebfc3b49 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

don't pass nameidata to ->create()

boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7d8a4569 07-Jun-2012 Will Deacon <will@kernel.org>

ipc: shm: restore MADV_REMOVE functionality on shared memory segments

Commit 17cf28afea2a ("mm/fs: remove truncate_range") removed the
truncate_range inode operation in favour of the fallocate file
operation.

When using SYSV IPC shared memory segments, calling madvise with the
MADV_REMOVE advice on an area of shared memory will attempt to invoke
the .fallocate function for the shm_file_operations, which is NULL and
therefore returns -EOPNOTSUPP to userspace. The previous behaviour
would inherit the inode_operations from the underlying tmpfs file and
invoke truncate_range there.

This patch restores the previous behaviour by wrapping the underlying
fallocate function in shm_fallocate, as we do for fsync.

[hughd@google.com: use -ENOTSUPP in shm_fallocate()]
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1193755a 01-Jun-2012 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs changes from Al Viro.
"A lot of misc stuff. The obvious groups:
* Miklos' atomic_open series; kills the damn abuse of
->d_revalidate() by NFS, which was the major stumbling block for
all work in that area.
* ripping security_file_mmap() and dealing with deadlocks in the
area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
general.
* ->encode_fh() switched to saner API; insane fake dentry in
mm/cleancache.c gone.
* assorted annotations in fs (endianness, __user)
* parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
* ->update_time() work from Josef.
* other bits and pieces all over the place.

Normally it would've been in two or three pull requests, but
signal.git stuff had eaten a lot of time during this cycle ;-/"

Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
nfs: don't open in ->d_revalidate
vfs: retry last component if opening stale dentry
vfs: nameidata_to_filp(): don't throw away file on error
vfs: nameidata_to_filp(): inline __dentry_open()
vfs: do_dentry_open(): don't put filp
vfs: split __dentry_open()
vfs: do_last() common post lookup
vfs: do_last(): add audit_inode before open
vfs: do_last(): only return EISDIR for O_CREAT
vfs: do_last(): check LOOKUP_DIRECTORY
vfs: do_last(): make ENOENT exit RCU safe
vfs: make follow_link check RCU safe
vfs: do_last(): use inode variable
vfs: do_last(): inline walk_component()
vfs: do_last(): make exit RCU safe
vfs: split do_lookup()
Btrfs: move over to use ->update_time
fs: introduce inode operation ->update_time
reiserfs: get rid of resierfs_sync_super
reiserfs: mark the superblock as dirty a bit later
...


e3fc629d 30-May-2012 Al Viro <viro@zeniv.linux.org.uk>

switch aio and shm to do_mmap_pgoff(), make do_mmap() static

after all, 0 bytes and 0 pages is the same thing...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8b3ec681 30-May-2012 Al Viro <viro@zeniv.linux.org.uk>

take security_mmap_file() outside of ->mmap_sem

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ce2d52cc 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: add rbtree node caching support

When I wrote the first patch that added the rbtree support for message
queue insertion, it sped up the case where the queue was very full
drastically from the original code. It, however, slowed down the case
where the queue was empty (not drastically though).

This patch caches the last freed rbtree node struct so we can quickly
reuse it when we get a new message. This is the common path for any queue
that very frequently goes from 0 to 1 then back to 0 messages in queue.

Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
msg_insert, so this patch attempts to speculatively allocate a new node
struct outside of the spin lock when we know we need it, but will still
fall back to a GFP_ATOMIC allocation if it has to.

Once I added the caching, the necessary various ret = ; spin_unlock
gyrations in mq_timedsend were getting pretty ugly, so this also slightly
refactors that function to streamline the flow of the code and the
function exit.

Finally, while working on getting performance back I made sure that all of
the node structs were always fully initialized when they were first used,
rendering the use of kzalloc unnecessary and a waste of CPU cycles.

The net result of all of this is:

1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
on it when necessary.

2) We will speculatively allocate a node struct using GFP_KERNEL if our
cache is empty (and save the struct to our cache if it's still empty
after we have obtained the spin lock).

3) The performance of the common queue empty case has significantly
improved and is now much more in line with the older performance for
this case.

The performance changes are:

Old mqueue new mqueue new mqueue + caching
queue empty
send/recv 305/288ns 349/318ns 310/322ns

I don't think we'll ever be able to get the recv performance back, but
that's because the old recv performance was a direct result and
consequence of the old methods abysmal send performance. The recv path
simply must do more so that the send path does not incur such a penalty
under higher queue depths.

As it turns out, the new caching code also sped up the various queue full
cases relative to my last patch. That could be because of the difference
between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
change in code flow in the mq_timedsend routine. Regardless, I'll take
it. It wasn't huge, and I *would* say it was within the margin for error,
but after many repeated runs what I'm seeing is that the old numbers trend
slightly higher (about 10 to 20ns depending on which test is the one
running).

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

113289cc 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: strengthen checks on mqueue creation

We already check the mq attr struct if it's passed in, but now that the
admin can set system wide defaults separate from maximums, it's actually
possible to set the defaults to something that would overflow. So, if
there is no attr struct passed in to the open call, check the default
values.

While we are at it, simplify mq_attr_ok() by making it return 0 or an
error condition, so that way if we add more tests to it later, we have the
option of what error should be returned instead of the calling location
having to pick a possibly inaccurate error code.

[akpm@linux-foundation.org: s/ENOMEM/EOVERFLOW/]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2c12ea49 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: correct mq_attr_ok test

While working on the other parts of the mqueue stuff, I noticed that the
calculation for overflow in mq_attr_ok didn't actually match reality (this
is especially true since my last patch which changed how we account memory
slightly).

In particular, we used to test for overflow using:
msgs * msgsize + msgs * sizeof(struct msg_msg *)

That was never really correct because each message we allocate via
load_msg() is actually a struct msg_msg followed by the data for the
message (and if struct msg_msg + data exceeds PAGE_SIZE we end up
allocating struct msg_msgseg structs too, but accounting for them would
get really tedious, so let's ignore those...they're only a pointer in size
anyway). This patch updates the calculation to be more accurate in
regards to maximum possible memory consumption by the mqueue.

[akpm@linux-foundation.org: add a local to simplify overflow-checking expression]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d6629859 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: improve performance of send/recv

The existing implementation of the POSIX message queue send and recv
functions is, well, abysmal. Even worse than abysmal. I submitted a
patch to increase the maximum POSIX message queue limit to 65536 due to
customer needs, however, upon looking over the send/recv implementation, I
realized that my customer needs help with that too even if they don't know
it. The basic problem is that, given the fairly typical use case scenario
for a large queue of queueing lots of messages all at the same priority (I
verified with my customer that this is indeed what their app does), the
msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
that's *so* middle school.

OK, OK, to not slam the original author too much, I'm sure they didn't
envision a queue depth of 50,000+ messages. No one would think that
moving elements in an array, one at a time, and dereferencing each pointer
in that array to check priority of the message being pointed too, again
one at a time, for 50,000+ times would be good. So let's assume that, as
is typical, the users have found a way to break our code simply by using
it in a way we didn't envision. Fair enough.

"So, just how broken is it?", you ask. I wondered the same thing, so I
wrote an app to let me know. It's my next patch. It gave me some
interesting results. Here's what it tested:

Interference with other apps - In continuous mode, the app just sits there
and hits a message queue forever, while you go do something productive on
another terminal using other CPUs. You then measure how long it takes you
to do that something productive. Then you restart the app in fake
continuous mode, and it sits in a tight loop on a CPU while you repeat
your tests. The whole point of this is to keep one CPU tied up (so it
can't be used in your other work) but in one case tied up hitting the
mqueue code so we can see the effect of walking that 65,528 element array
one pointer at a time on the global CPU cache. If it's bad, then it will
slow down your app on the other CPUs just by polluting cache mercilessly.
In the fake case, it will be in a tight loop, but not polluting cache.
Testing the mqueue subsystem directly - Here we just run a number of tests
to see how the mqueue subsystem performs under different conditions. A
couple conditions are known to be worst case for the old system, and some
routines, so this tests all of them.

So, on to the results already:

Subsystem/Test Old New

Time to compile linux
kernel (make -j12 on a
6 core CPU)
Running mqueue test user 49m10.744s user 45m26.294s
sys 5m51.924s sys 4m59.894s
total 55m02.668s total 50m26.188s

Running fake test user 45m32.686s user 45m18.552s
sys 5m12.465s sys 4m56.468s
total 50m45.151s total 50m15.020s

% slowdown from mqueue
cache thrashing ~8% ~.5%

Avg time to send/recv (in nanoseconds per message)
when queue empty 305/288 349/318
when queue full (65528 messages)
constant priority 526589/823 362/314
increasing priority 403105/916 495/445
decreasing priority 73420/594 482/409
random priority 280147/920 546/436

Time to fill/drain queue (65528 messages, in seconds)
constant priority 17.37/.12 .13/.12
increasing priority 4.14/.14 .21/.18
decreasing priority 12.93/.13 .21/.18
random priority 8.88/.16 .22/.17

So, I think the results speak for themselves. It's possible this
implementation could be improved by cacheing at least one priority level
in the node tree (that would bring the queue empty performance more in
line with the old implementation), but this works and is *so* much better
than what we had, especially for the common case of a single priority in
use, that further refinements can be in follow on patches.

[akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
[levinsasha928@gmail.com: use correct gfp flags in msg_insert]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cef0184c 31-May-2012 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mqueue: separate mqueue default value from maximum value

Commit b231cca4381e ("message queues: increase range limits") changed
mqueue default value when attr parameter is specified NULL from hard
coded value to fs.mqueue.{msg,msgsize}_max sysctl value.

This made large side effect. When user need to use two mqueue
applications 1) using !NULL attr parameter and it require big message
size and 2) using NULL attr parameter and only need small size message,
app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
memory size even though it doesn't need.

Doug Ledford propsed to switch back it to static hard coded value.
However it also has a compatibility problem. Some applications might
started depend on the default value is tunable.

The solution is to separate default value from maximum value.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Joe Korty <joe.korty@ccur.com>
Cc: Amerigo Wang <amwang@redhat.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fd1f87d2 31-May-2012 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mqueue: don't use kmalloc with KMALLOC_MAX_SIZE

KMALLOC_MAX_SIZE is not a good threshold. It is extremely high and
problematic. Unfortunately, some silly drivers depend on this and we
can't change it. But any new code needn't use such extreme ugly high
order allocations. It brings us awful fragmentation issues and system
slowdown.

Signed-off-by: KOSAKI Motohiro <mkosaki@jp.fujitsu.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Joe Korty <joe.korty@ccur.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5b5c4d1a 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: update maximums for the mqueue subsystem

Commit b231cca4381e ("message queues: increase range limits") changed the
maximum size of a message in a message queue from INT_MAX to 8192*128.
Unfortunately, we had customers that relied on a size much larger than
8192*128 on their production systems. After reviewing POSIX, we found
that it is silent on the maximum message size. We did find a couple other
areas in which it was not silent. Fix up the mqueue maximums so that the
customer's system can continue to work, and document both the POSIX and
real world requirements in ipc_namespace.h so that we don't have this
issue crop back up.

Also, commit 9cf18e1dd74cd0 ("ipc: HARD_MSGMAX should be higher not lower
on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
intentionally in place to limit the msg queue depth to one that was small
enough to kmalloc an array of pointers (hence why we divided 128k by
sizeof(long)). If we wish to meet POSIX requirements, we have no choice
but to change our allocation to a vmalloc instead (at least for the large
queue size case). With that, it's possible to increase our allowed
maximum to the POSIX requirements (or more if we choose).

[sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

02967ea0 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: enforce hard limits

In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps.
In preparation for making more reasonable hard limits, start enforcing
them even on CAP_SYS_RESOURCE.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

858ee378 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: switch back to using non-max values on create

Commit b231cca4381e ("message queues: increase range limits") changed
how we create a queue that does not include an attr struct passed to
open so that it creates the queue with whatever the maximum values are.
However, if the admin has set the maximums to allow flexibility in
creating a queue (aka, both a large size and large queue are allowed,
but combined they create a queue too large for the RLIMIT_MSGQUEUE of
the user), then attempts to create a queue without an attr struct will
fail. Switch back to using acceptable defaults regardless of what the
maximums are.

Note: so far, we only know of a few applications that rely on this
behavior (specifically, set the maximums in /proc, then run the
application which calls mq_open() without passing in an attr struct, and
the application expects the newly created message queue to have the
maximum sizes that were set in /proc used on the mq_open() call, and all
of those applications that we know of are actually part of regression
test suites that were coded to do something like this:

for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do
echo $size > /proc/sys/fs/mqueue/msgsize_max
mq_open || echo "Error opening mq with size $size"
done

These test suites that depend on any behavior like this are broken. The
concept that programs should rely upon the system wide maximum in order
to get their desired results instead of simply using a attr struct to
specify what they want is fundamentally unfriendly programming practice
for any multi-tasking OS.

Fixing this will break those few apps that we know of (and those app
authors recognize the brokenness of their code and the need to fix it).
However, the following patch "mqueue: separate mqueue default value"
allows a workaround in the form of new knobs for the default msg queue
creation parameters for any software out there that we don't already
know about that might rely on this behavior at the moment.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

93e6f119 31-May-2012 Doug Ledford <dledford@redhat.com>

ipc/mqueue: cleanup definition names and locations

Since commit b231cca4381e ("message queues: increase range limits") on
Oct 18, 2008, calls to mq_open() that did not pass in an attribute
struct and expected to get default values for the size of the queue and
the max message size now get the system wide maximums instead of
hardwired defaults like they used to get.

This was uncovered when one of the earlier patches in this patch set
increased the default system wide maximums at the same time it increased
the hard ceiling on the system wide maximums (a customer specifically
needed the hard ceiling brought back up, the new ceiling that commit
b231cca4381e introduced was too low for their production systems). By
increasing the default maximums and not realising they were tied to any
attempt to create a message queue without an attribute struct, I had
inadvertently made it such that all message queue creation attempts
without an attribute struct were failing because the new default
maximums would create a queue that exceeded the default rlimit for
message queue bytes.

As a result, the system wide defaults were brought back down to their
previous levels, and the system wide ceilings on the maximums were
raised to meet the customer's needs. However, the fact that the no
attribute struct behavior of mq_open() could be broken by changing the
system wide maximums for message queues was seen as fundamentally broken
itself. So we hardwired the no attribute case back like it used to be.
But, then we realized that on the very off chance that some piece of
software in the wild depended on that behavior, we could work around
that issue by adding two new knobs to /proc that allowed setting the
defaults for message queues created without an attr struct separately
from the system wide maximums.

What is not an option IMO is to leave the current behavior in place. No
piece of software should ever rely on setting the system wide maximums
in order to get a desired message queue. Such a reliance would be so
fundamentally multitasking OS unfriendly as to not really be tolerable.
Fortunately, we don't know of any software in the wild that uses this
except for a regression test program that caught the issue in the first
place. If there is though, we have made accommodations with the two new
/proc knobs (and that's all the accommodations such fundamentally broken
software can be allowed)..

This patch:

The various defines for minimums and maximums of the sysctl controllable
mqueue values are scattered amongst different files and named
inconsistently. Move them all into ipc_namespace.h and make them have
consistent names. Additionally, make the number of queues per namespace
also have a minimum and maximum and use the same sysctl function as the
other two settable variables.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

90324cc1 28-May-2012 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback tree from Wu Fengguang:
"Mainly from Jan Kara to avoid iput() in the flusher threads."

* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Avoid iput() from flusher thread
vfs: Rename end_writeback() to clear_inode()
vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
writeback: Refactor writeback_single_inode()
writeback: Remove wb->list_lock from writeback_single_inode()
writeback: Separate inode requeueing after writeback
writeback: Move I_DIRTY_PAGES handling
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
writeback: Move clearing of I_SYNC into inode_sync_complete()
writeback: initialize global_dirty_limit
fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
mm: page-writeback.c: local functions should not be exposed globally


dbd5768f 03-May-2012 Jan Kara <jack@suse.cz>

vfs: Rename end_writeback() to clear_inode()

After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>

76b6db01 14-Mar-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Replace user_ns_map_uid and user_ns_map_gid with from_kuid and from_kgid

These function are no longer needed replace them with their more useful equivalents.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

6f9ac6d9 16-Nov-2011 Eric W. Biederman <ebiederm@xmission.com>

mqueue: Explicitly capture the user namespace to send the notification to.

Stop relying on user->user_ns which is going away and instead capture
the user_namespace of the process we are supposed to notify.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

c4a4d603 17-Nov-2011 Eric W. Biederman <ebiederm@xmission.com>

userns: Use cred->user_ns instead of cred->user->user_ns

Optimize performance and prepare for the removal of the user_ns reference
from user_struct. Remove the slow long walk through cred->user->user_ns and
instead go straight to cred->user_ns.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

50483c32 29-Mar-2012 Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile

Pull arch/tile (really asm-generic) update from Chris Metcalf:
"These are a couple of asm-generic changes that apply to tile."

* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
compat: use sys_sendfile64() implementation for sendfile syscall
[PATCH v3] ipc: provide generic compat versions of IPC syscalls


95211279 22-Mar-2012 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (Andrew's patch-bomb)

Merge first batch of patches from Andrew Morton:
"A few misc things and all the MM queue"

* emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits)
memcg: avoid THP split in task migration
thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
memcg: clean up existing move charge code
mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
mm/memcontrol.c: s/stealed/stolen/
memcg: fix performance of mem_cgroup_begin_update_page_stat()
memcg: remove PCG_FILE_MAPPED
memcg: use new logic for page stat accounting
memcg: remove PCG_MOVE_LOCK flag from page_cgroup
memcg: simplify move_account() check
memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
memcg: kill dead prev_priority stubs
memcg: remove PCG_CACHE page_cgroup flag
memcg: let css_get_next() rely upon rcu_read_lock()
cgroup: revert ss_id_lock to spinlock
idr: make idr_get_next() good for rcu_read_lock()
memcg: remove unnecessary thp check in page stat accounting
memcg: remove redundant returns
memcg: enum lru_list lru
...


40716e29 21-Mar-2012 Steven Truelove <steven.truelove@utoronto.ca>

hugetlbfs: fix alignment of huge page requests

When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
PAGE_SIZE, but this is not sufficient.

Modify hugetlb_file_setup() to align requests to the huge page size, and
to accept an address argument so that all alignment checks can be
performed in hugetlb_file_setup(), rather than in its callers. Change
newseg() and mmap_pgoff() to match the new prototype and eliminate a now
redundant alignment check.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Steven Truelove <steven.truelove@utoronto.ca>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e2a0883e 21-Mar-2012 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs pile 1 from Al Viro:
"This is _not_ all; in particular, Miklos' and Jan's stuff is not there
yet."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
ext4: initialization of ext4_li_mtx needs to be done earlier
debugfs-related mode_t whack-a-mole
hfsplus: add an ioctl to bless files
hfsplus: change finder_info to u32
hfsplus: initialise userflags
qnx4: new helper - try_extent()
qnx4: get rid of qnx4_bread/qnx4_getblk
take removal of PF_FORKNOEXEC to flush_old_exec()
trim includes in inode.c
um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
um: embed ->stub_pages[] into mmu_context
gadgetfs: list_for_each_safe() misuse
ocfs2: fix leaks on failure exits in module_init
ecryptfs: make register_filesystem() the last potential failure exit
ntfs: forgets to unregister sysctls on register_filesystem() failure
logfs: missing cleanup on register_filesystem() failure
jfs: mising cleanup on register_filesystem() failure
make configfs_pin_fs() return root dentry on success
configfs: configfs_create_dir() has parent dentry in dentry->d_parent
configfs: sanitize configfs_create()
...


48fde701 08-Jan-2012 Al Viro <viro@zeniv.linux.org.uk>

switch open-coded instances of d_make_root() to new helper

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

48b25c43 15-Mar-2012 Chris Metcalf <cmetcalf@tilera.com>

[PATCH v3] ipc: provide generic compat versions of IPC syscalls

When using the "compat" APIs, architectures will generally want to
be able to make direct syscalls to msgsnd(), shmctl(), etc., and
in the kernel we would want them to be handled directly by
compat_sys_xxx() functions, as is true for other compat syscalls.

However, for historical reasons, several of the existing compat IPC
syscalls do not do this. semctl() expects a pointer to the fourth
argument, instead of the fourth argument itself. msgsnd(), msgrcv()
and shmat() expect arguments in different order.

This change adds an ARCH_WANT_OLD_COMPAT_IPC config option that can be
set to preserve this behavior for ports that use it (x86, sparc, powerpc,
s390, and mips). No actual semantics are changed for those architectures,
and there is only a minimal amount of code refactoring in ipc/compat.c.

Newer architectures like tile (and perhaps future architectures such
as arm64 and unicore64) should not select this option, and thus can
avoid having any IPC-specific code at all in their architecture-specific
compat layer. In the same vein, if this option is not selected, IPC_64
mode is assumed, since that's what the <asm-generic> headers expect.

The workaround code in "tile" for msgsnd() and msgrcv() is removed
with this change; it also fixes the bug that shmat() and semctl() were
not being properly handled.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>

40401530 12-Feb-2012 Al Viro <viro@ftp.linux.org.uk>

security: trim security.h

Trim security.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>

24513264 20-Jan-2012 Hugh Dickins <hughd@google.com>

SHM_UNLOCK: fix Unevictable pages stranded after swap

Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.

The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages(). The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.

Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().

But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.

Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.

Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

85046579 20-Jan-2012 Hugh Dickins <hughd@google.com>

SHM_UNLOCK: fix long unpreemptible section

scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
evictable again once the shared memory is unlocked. It does this with
pagevec_lookup()s across the whole object (which might occupy most of
memory), and takes 300ms to unlock 7GB here. A cond_resched() every
PAGEVEC_SIZE pages would be good.

However, KOSAKI-san points out that this is called under shmem.c's
info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
There is no strong reason for that: we need to take these pages off the
unevictable list soonish, but those locks are not required for it.

So move the call to scan_mapping_unevictable_pages() from shmem.c's
unlock handling up to shm.c's unlock handling. Remove the recently
added barrier, not needed now we have spin_unlock() before the scan.

Use get_file(), with subsequent fput(), to make sure we have a reference
to mapping throughout scan_mapping_unevictable_pages(): that's something
that was previously guaranteed by the shm_lock().

Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
time, and we lazily discover them to be Unevictable later, so it serves
no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
pages still on pagevec are not marked Unevictable.

The original code avoided redundant rescans by checking VM_LOCKED flag
at its level: now avoid them by checking shp's SHM_LOCKED.

The original code called scan_mapping_unevictable_pages() on a locked
area at shm_destroy() time: perhaps we once had accounting cross-checks
which required that, but not now, so skip the overhead and just let
inode eviction deal with them.

Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
more as comment than to save space; comment them used for SHM_UNLOCK.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2a4e64b8 20-Jan-2012 Davidlohr Bueso <dave@gnu.org>

ipc/mqueue: simplify reading msgqueue limit

Because the current task is being used to get the limit, we can simply
use rlimit() instead of task_rlimit().

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6b550f94 10-Jan-2012 Serge E. Hallyn <serge@hallyn.com>

user namespace: make signal.c respect user namespaces

ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
user namespace. (new, thanks Oleg)

__send_signal: convert current's uid to the recipient's user namespace
for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
again :)

do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
user namespace

ptrace_signal maps parent's uid into current's user namespace before
including in signal to current. IIUC Oleg has argued that this shouldn't
matter as the debugger will play with it, but it seems like not converting
the value currently being set is misleading.

Changelog:
Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
simplify callers and help make clear what we are translating
(which uid into which namespace). Passing the target task would
make callers even easier to read, but we pass in user_ns because
current_user_ns() != task_cred_xxx(current, user_ns).
Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
in ptrace_signal().
Sep 23: In send_signal(), detect when (user) signal is coming from an
ancestor or unrelated user namespace. Pass that on to __send_signal,
which sets si_uid to 0 or overflowuid if needed.
Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
SI_FROMKERNEL cases at callers, because we can't assume sender is
current in those cases.
Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
Nov 10: (akpm) make the !CONFIG_USER_NS case clearer

Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Serge Hallyn <serge.hallyn@canonical.com>
Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)

Eric Biederman pointed out that passing info is a bug and could lead to a
NULL pointer deref to boot.

A collection of signal, securebits, filecaps, cap_bounds, and a few other
ltp tests passed with this kernel.

Changelog:
Nov 18: previous patch missed a leading '&'

Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: ipc/mqueue: lock() => unlock() typo

There was a double lock typo introduced in b085f4bd6b21 "user namespace:
make signal.c respect user namespaces"

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

df0a4283 26-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch mq_open() to umode_t

1b9d5ff7 24-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

mqueue: propagate umode_t

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

4acdaf27 25-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch ->create() to umode_t

vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

6b520e05 12-Dec-2011 Al Viro <viro@zeniv.linux.org.uk>

vfs: fix the stupidity with i_dentry in inode destructors

Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

6f686574 08-Dec-2011 Al Viro <viro@zeniv.linux.org.uk>

... and the same kind of leak for mqueue

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

e57940d7 02-Nov-2011 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: remove private structures from public header file

include/linux/sem.h contains several structures that are only used within
ipc/sem.c.

The patch moves them into ipc/sem.c - there is no need to expose the
structures to the whole kernel.

No functional changes, only whitespace cleanups and 80-char per line
fixes.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0b0577f6 02-Nov-2011 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: handle spurious wakeups

semtimedop() does not handle spurious wakeups, it returns -EINTR to user
space. Most other schedule() users would just loop and not return to user
space. The patch adds such a loop to semtimedop()

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3c24783b 02-Nov-2011 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix return code race with semop vs. semop +semctl(IPC_RMID)

sys_semtimedop() may return -EIDRM although the semaphore operation
completed successfully:

thread 1: thread 2:
semtimedop(), sleeps
semop():
* acquires sem_lock()
semtimedop() woken up due to timeout
sem_lock() loops
* notices that thread 2 could be completed.
* performs the operations that thread 2 is sleeping on.
* marks the semaphore operation as IN_WAKEUP
* drops sem_lock(), does wakeup, sets return code to 0
* thread delayed due to interrupt, whatever
* returns to user space
* thread still delayed
semctl(IPC_RMID)
* acquires sem_lock()
* ipc_rmid(), ipcp->deleted=1
* drops sem_lock()
* thread finally continues - but seem_lock()
now fails due to ipcp->deleted == 1
* returns -EIDRM instead of 0

The fix is trivial: Always use the return code in queue.status.

In real world, the race probably doesn't matter:
If the semaphore array is destroyed, the app is probably not interested
if the last operation succeeded or was already cancelled.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

32ea845d 31-Oct-2011 Wanlong Gao <gaowanlong@cn.fujitsu.com>

ipc/mqueue.c: fix wrong use of schedule_hrtimeout_range_clock()

Fix the wrong use of schedule_hrtimeout_range_clock() in wq_sleep(),
although it is harmless for the syscall mq_timed* now. It was introduced
by 9ca7d8e ("mqueue: Convert message queue timeout to use hrtimers").

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Carsten Emde <C.Emde@osadl.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

140d0b21 04-Aug-2011 Linus Torvalds <torvalds@linux-foundation.org>

Do 'shm_init_ns()' in an early pure_initcall

This isn't really critical any more, since other patches (commit
298507d4d2cf: "shm: optimize exit_shm()") have caused us to not actually
need to touch the rw_mutex unless there are actual shm segments
associated with the namespace, but we really should do tne shm_init_ns()
earlier than we do now.

This, together with commit 288d5abec831 ("Boot up with usermodehelper
disabled") will mean that we really do initialize the initial ipc
namespace data structure before we run any tasks.

Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

298507d4 03-Aug-2011 Vasiliy Kulikov <segoon@openwall.com>

shm: optimize exit_shm()

We may optimistically check .in_use == 0 without holding the rw_mutex:
it's the common case, and if it's zero, there certainly won't be any
segments associated with us.

After taking the lock, the idr_for_each() will do the right thing, so we
could now drop the re-check inside the lock without any real cost. But
it won't hurt.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

33a30ed4 03-Aug-2011 Vasiliy Kulikov <segoon@openwall.com>

shm: fix wrong tests

Commit 4c677e2eefdb ("shm: optimize locking and ipc_namespace getting")
introduced a copy-paste bug. Due to the bug cycle optimizations were
disabled.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4c677e2e 28-Jul-2011 Vasiliy Kulikov <segoon@openwall.com>

shm: optimize locking and ipc_namespace getting

shm_lock() does a lookup of shm segment in shm_ids(ns).ipcs_idr, which
is redundant as we already know shmid_kernel address. An actual lock is
also not required for reads until we really want to destroy the segment.

exit_shm() and shm_destroy_orphaned() may avoid the loop by checking
whether there is at least one segment in current ipc_namespace.

The check of nsproxy and ipc_ns against NULL is redundant as exit_shm()
is called from do_exit() before the call to exit_notify(), so the
dereferencing current->nsproxy->ipc_ns is guaranteed to be safe.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5774ed01 28-Jul-2011 Vasiliy Kulikov <segoon@openwall.com>

shm: handle separate PID namespaces case

shm_try_destroy_orphaned() and shm_try_destroy_current() didn't handle
the case of separate PID namespaces, but a single IPC namespace. If
there are tasks with the same PID values using the same shmem object,
the wrong destroy decision could be reached.

On shm segment creation store the pointer to the creator task in
shmid_kernel->shm_creator field and zero it on task exit. Then
use the ->shm_creator insread of shm_cprid in both functions. As
shmid_kernel object is already locked at this stage, no additional
locking is needed.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b34a6b1d 26-Jul-2011 Vasiliy Kulikov <segoon@openwall.com>

ipc: introduce shm_rmid_forced sysctl

Add support for the shm_rmid_forced sysctl. If set to 1, all shared
memory objects in current ipc namespace will be automatically forced to
use IPC_RMID.

The POSIX way of handling shmem allows one to create shm objects and
call shmdt(), leaving shm object associated with no process, thus
consuming memory not counted via rlimits.

With shm_rmid_forced=1 the shared memory object is counted at least for
one process, so OOM killer may effectively kill the fat process holding
the shared memory.

It obviously breaks POSIX - some programs relying on the feature would
stop working. So set shm_rmid_forced=1 only if you're sure nobody uses
"orphaned" memory. Use shm_rmid_forced=0 by default for compatability
reasons.

The feature was previously impemented in -ow as a configure option.

[akpm@linux-foundation.org: fix documentation, per Randy]
[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: readability/conventionality tweaks]
[akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d40dcdb0 26-Jul-2011 Jiri Slaby <jirislaby@kernel.org>

ipc/mqueue.c: fix mq_open() return value

We return ENOMEM from mqueue_get_inode even when we have enough memory.
Namely in case the system rlimit of mqueue was reached. This error
propagates to mq_queue and user sees the error unexpectedly. So fix
this up to properly return EMFILE as described in the manpage:

EMFILE The process already has the maximum number of files and
message queues open.

instead of:

ENOMEM Insufficient memory.

With the previous patch we just switch to ERR_PTR/PTR_ERR/IS_ERR error
handling here.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

04715206 26-Jul-2011 Jiri Slaby <jirislaby@kernel.org>

ipc/mqueue.c: refactor failure handling

If new_inode fails to allocate an inode we need only to return with
NULL. But now we test the opposite and have all the work in a nested
block. So do the opposite to save one indentation level (and remove
unnecessary line breaks).

This is only a preparation/cleanup for the next patch where we fix up
return values from mqueue_get_inode.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d694ad62 25-Jul-2011 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: fix race with concurrent semtimedop() timeouts and IPC_RMID

If a semaphore array is removed and in parallel a sleeping task is woken
up (signal or timeout, does not matter), then the woken up task does not
wait until wake_up_sem_queue_do() is completed. This will cause crashes,
because wake_up_sem_queue_do() will read from a stale pointer.

The fix is simple: Regardless of anything, always call get_queue_result().
This function waits until wake_up_sem_queue_do() has finished it's task.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=27142

Reported-by: Yuriy Yevtukhov <yuriy@ucoz.com>
Reported-by: Harald Laabs <kernel@dasr.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@kernel.org> [2.6.35+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bbd9d6f7 22-Jul-2011 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
isofs: Remove global fs lock
jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
mm/truncate.c: fix build for CONFIG_BLOCK not enabled
fs:update the NOTE of the file_operations structure
Remove dead code in dget_parent()
AFS: Fix silly characters in a comment
switch d_add_ci() to d_splice_alias() in "found negative" case as well
simplify gfs2_lookup()
jfs_lookup(): don't bother with . or ..
get rid of useless dget_parent() in btrfs rename() and link()
get rid of useless dget_parent() in fs/btrfs/ioctl.c
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
drivers: fix up various ->llseek() implementations
fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
Ext4: handle SEEK_HOLE/SEEK_DATA generically
Btrfs: implement our own ->llseek
fs: add SEEK_HOLE and SEEK_DATA flags
reiserfs: make reiserfs default to barrier=flush
...

Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.


02c24a82 16-Jul-2011 Josef Bacik <josef@redhat.com>

fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d4ee9aa3 17-Mar-2011 Lai Jiangshan <laijs@cn.fujitsu.com>

ipc,rcu: Convert call_rcu(ipc_immediate_free) to kfree_rcu()

The rcu callback ipc_immediate_free() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(ipc_immediate_free).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

693a8b6e 17-Mar-2011 Lai Jiangshan <laijs@cn.fujitsu.com>

ipc,rcu: Convert call_rcu(free_un) to kfree_rcu()

The rcu callback free_un() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_un).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

ca16d140 26-May-2011 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

mm: don't access vm_flags as 'int'

The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
[ Changed to use a typedef - we'll extend it to cover more cases
later, since there has been discussion about making it a 64-bit
type.. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a00eaf11 07-Mar-2010 Eric W. Biederman <ebiederm@xmission.com>

ns proc: Add support for the ipc namespace

Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>

6213cfe8 26-Mar-2011 Randy Dunlap <randy.dunlap@oracle.com>

ipc: fix util.c kernel-doc warnings

Fix ipc/util.c kernel-doc warnings:

Warning(ipc/util.c:336): No description found for parameter 'ns'
Warning(ipc/util.c:620): No description found for parameter 'ns'
Warning(ipc/util.c:790): No description found for parameter 'ns'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

be4d250a 25-Mar-2011 Xiaotian Feng <dfeng@redhat.com>

ipcns: fix use after free in free_ipc_ns()

commit b515498 ("userns: add a user namespace owner of ipc ns") added a
user namespace owner of ipc ns, but it also introduced a use after free in
free_ipc_ns().

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Acked-by: "Serge E. Hallyn" <serge.hallyn@canonical.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b0e77598 23-Mar-2011 Serge E. Hallyn <serge@hallyn.com>

userns: user namespaces: convert several capable() calls

CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
because the resource comes from current's own ipc namespace.

setuid/setgid are to uids in own namespace, so again checks can be against
current_user_ns().

Changelog:
Jan 11: Use task_ns_capable() in place of sched_capable().
Jan 11: Use nsown_capable() as suggested by Bastian Blank.
Jan 11: Clarify (hopefully) some logic in futex and sched.c
Feb 15: use ns_capable for ipc, not nsown_capable
Feb 23: let copy_ipcs handle setting ipc_ns->user_ns
Feb 23: pass ns down rather than taking it from current

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b515498f 23-Mar-2011 Serge E. Hallyn <serge@hallyn.com>

userns: add a user namespace owner of ipc ns

Changelog:
Feb 15: Don't set new ipc->user_ns if we didn't create a new
ipc_ns.
Feb 23: Move extern declaration to ipc_namespace.h, and group
fwd declarations at top.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fa0d7e3d 06-Jan-2011 Nicholas Piggin <npiggin@gmail.com>

fs: icache RCU free inodes

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

3af54c9b 30-Oct-2010 Vasiliy Kulikov <segooon@gmail.com>

ipc: shm: fix information leak to userland

The shmid_ds structure is copied to userland with shm_unused{,2,3}
fields unitialized. It leads to leaking of contents of kernel stack
memory.

Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ceefda69 26-Jul-2010 Al Viro <viro@zeniv.linux.org.uk>

switch get_sb_ns() users

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

03145beb 27-Oct-2010 Dan Rosenberg <drosenberg@vsecurity.com>

ipc: initialize structure memory to zero for compat functions

This takes care of leaking uninitialized kernel stack memory to
userspace from non-zeroed fields in structs in compat ipc functions.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b7952180 27-Oct-2010 Helge Deller <deller@gmx.de>

ipc/shm.c: add RSS and swap size information to /proc/sysvipc/shm

The kernel currently provides no functionality to analyze the RSS and swap
space usage of each individual sysvipc shared memory segment.

This patch adds this info for each existing shm segment by extending the
output of /proc/sysvipc/shm by two columns for RSS and swap.

Since shmctl(SHM_INFO) already provides a similiar calculation (it
currently sums up all RSS/swap info for all segments), I did split out a
static function which is now used by the /proc/sysvipc/shm output and
shmctl(SHM_INFO).

SAP products (esp. the SAP Netweaver ABAP Kernel) uses lots of big shared
memory segments (we often have Linux systems with >= 16GB shm usage).
Sometimes we get customer reports about "slow" system responses and while
looking into their configurations we often find massive swapping activity
on the system. With this patch it's now easy to see from the command line
if and which shm segments gets swapped out (and how much) and can more
easily give recommendations for system tuning. Without the patch it's
currently not possible to do such shm analysis at all.

Also...

Add some spaces in front of the "size" field for 64bit kernels to get the
columns correct if you cat the contents of the file. In
sysvipc_shm_proc_show() the kernel prints the size value in "SPEC_SIZE"
format, which is defined like this:

#if BITS_PER_LONG <= 32
#define SIZE_SPEC "%10lu"
#else
#define SIZE_SPEC "%21lu"
#endif

So, if the header is not adjusted, the columns are not correctly aligned.
I actually tested this on 32- and 64-bit and it seems correct now.

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

85fe4025 23-Oct-2010 Christoph Hellwig <hch@lst.de>

fs: do not assign default i_ino in new_inode

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7de9c6ee 23-Oct-2010 Al Viro <viro@zeniv.linux.org.uk>

new helper: ihold()

Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

092e0e7e 22-Oct-2010 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl

* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek


6038f373 15-Aug-2010 Arnd Bergmann <arnd@arndb.de>

llseek: automatically add .llseek fop

All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.

The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.

New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.

The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.

Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.

Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.

===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}

@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}

@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}

@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}

@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}

@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}

@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};

@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};

@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};

@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};

@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};

// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};

@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};

// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};

// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};

// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};

@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};

// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////

@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};

@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};

@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};

@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>

982f7c2b 30-Sep-2010 Dan Rosenberg <drosenberg@vsecurity.com>

sys_semctl: fix kernel stack leakage

The semctl syscall has several code paths that lead to the leakage of
uninitialized kernel stack memory (namely the IPC_INFO, SEM_INFO,
IPC_STAT, and SEM_STAT commands) during the use of the older, obsolete
version of the semid_ds struct.

The copy_semid_to_user() function declares a semid_ds struct on the stack
and copies it back to the user without initializing or zeroing the
"sem_base", "sem_pending", "sem_pending_last", and "undo" pointers,
allowing the leakage of 16 bytes of kernel stack memory.

The code is still reachable on 32-bit systems - when calling semctl()
newer glibc's automatically OR the IPC command with the IPC_64 flag, but
invoking the syscall directly allows users to use the older versions of
the struct.

Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6d8af64c 05-Jun-2010 Al Viro <viro@zeniv.linux.org.uk>

switch mqueue to ->evict_inode()

... and since the inodes are never hashed, we can use default ->drop_inode()
just fine.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c61284e9 20-Jul-2010 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: bugfix for semop() not reporting successful operation

The last change to improve the scalability moved the actual wake-up out of
the section that is protected by spin_lock(sma->sem_perm.lock).

This means that IN_WAKEUP can be in queue.status even when the spinlock is
acquired by the current task. Thus the same loop that is performed when
queue.status is read without the spinlock acquired must be performed when
the spinlock is acquired.

Thanks to kamezawa.hiroyu@jp.fujitsu.com for noticing lack of the memory
barrier.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16255

[akpm@linux-foundation.org: clean up kerneldoc, checkpatch warning and whitespace]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Luca Tettamanti <kronos.it@gmail.com>
Tested-by: Luca Tettamanti <kronos.it@gmail.com>
Reported-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0abbb609 28-May-2010 Al Viro <viro@zeniv.linux.org.uk>

mqueue doesn't need make_bad_inode()

It never hashes them anyway and does final iput() immediately
afterwards. With ->drop_inode() being generic_delete_inode()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7ea80859 26-May-2010 Christoph Hellwig <hch@lst.de>

drop unused dentry argument to ->fsync

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

4de85cd6 26-May-2010 Julia Lawall <julia@diku.dk>

ipc/sem.c: use ERR_CAST

Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c5cf6359 26-May-2010 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: update description of the implementation

ipc/sem.c begins with a 15 year old description about bugs in the initial
implementation in Linux-1.0. The patch replaces that with a top level
description of the current code.

A TODO could be derived from this text:

The opengroup man page for semop() does not mandate FIFO. Thus there is
no need for a semaphore array list of pending operations.

If

- this list is removed
- the per-semaphore array spinlock is removed (possible if there is no
list to protect)
- sem_otime is moved into the semaphores and calculated on demand during
semctl()

then the array would be read-mostly - which would significantly improve
scaling for applications that use semaphore arrays with lots of entries.

The price would be expensive semctl() calls:

for(i=0;i<sma->sem_nsems;i++) spin_lock(sma->sem_lock);
<do stuff>
for(i=0;i<sma->sem_nsems;i++) spin_unlock(sma->sem_lock);

I'm not sure if the complexity is worth the effort, thus here is the
documentation of the current behavior first.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0a2b9d4c 26-May-2010 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: move wake_up_process out of the spinlock section

The wake-up part of semtimedop() consists out of two steps:

- the right tasks must be identified.
- they must be woken up.

Right now, both steps run while the array spinlock is held. This patch
reorders the code and moves the actual wake_up_process() behind the point
where the spinlock is dropped.

The code also moves setting sem->sem_otime to one place: It does not make
sense to set the last modify time multiple times.

[akpm@linux-foundation.org: repair kerneldoc]
[akpm@linux-foundation.org: fix uninitialised retval]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fd5db422 26-May-2010 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize update_queue() for bulk wakeup calls

The following series of patches tries to fix the spinlock contention
reported by Chris Mason - his benchmark exposes problems of the current
code:

- In the worst case, the algorithm used by update_queue() is O(N^2).
Bulk wake-up calls can enter this worst case. The patch series fix
that.

Note that the benchmark app doesn't expose the problem, it just should
be fixed: Real world apps might do the wake-ups in another order than
perfect FIFO.

- The part of the code that runs within the semaphore array spinlock is
significantly larger than necessary.

The patch series fixes that. This change is responsible for the main
improvement.

- The cacheline with the spinlock is also used for a variable that is
read in the hot path (sem_base) and for a variable that is unnecessarily
written to multiple times (sem_otime). The last step of the series
cacheline-aligns the spinlock.

This patch:

The SysV semaphore code allows to perform multiple operations on all
semaphores in the array as atomic operations. After a modification,
update_queue() checks which of the waiting tasks can complete.

The algorithm that is used to identify the tasks is O(N^2) in the worst
case. For some cases, it is simple to avoid the O(N^2).

The patch adds a detection logic for some cases, especially for the case
of an array where all sleeping tasks are single sembuf operations and a
multi-sembuf operation is used to wake up multiple tasks.

A big database application uses that approach.

The patch fixes wakeup due to semctl(,,SETALL,) - the initial version of
the patch breaks that.

[akpm@linux-foundation.org: make do_smart_update() static]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4be929be 24-May-2010 Alexey Dobriyan <adobriyan@gmail.com>

kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

164d44fd 19-May-2010 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clocksource: Add clocksource_register_hz/khz interface
posix-cpu-timers: Optimize run_posix_cpu_timers()
time: Remove xtime_cache
mqueue: Convert message queue timeout to use hrtimers
hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME
timers: Introduce the concept of timer slack for legacy timers
ntp: Remove tickadj
ntp: Make time_adjust static
time: Add xtime, wall_to_monotonic to feature-removal-schedule
timer: Try to survive timer callback preempt_count leak
timer: Split out timer function call
timer: Print function name for timer callbacks modifying preemption count
time: Clean up warp_clock()
cpu-timers: Avoid iterating over all threads in fastpath_timer_check()
cpu-timers: Change SIGEV_NONE timer implementation
cpu-timers: Return correct previous timer reload value
cpu-timers: Cleanup arm_timer()
cpu-timers: Simplify RLIMIT_CPU handling


a3ed2a15 11-May-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: fix kernel BUG caused by double free() on mq_open()

In case of aborting because we reach the maximum amount of memory which
can be allocated to message queues per user (RLIMIT_MSGQUEUE), we would
try to free the message area twice when bailing out: first by the error
handling code itself, and then later when cleaning up the inode through
delete_inode().

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dbb6be6d 10-May-2010 Thomas Gleixner <tglx@linutronix.de>

Merge branch 'linus' into timers/core

Reason: Further posix_cpu_timer patches depend on mainline changes

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


9ca7d8e6 02-Apr-2010 Carsten Emde <C.Emde@osadl.org>

mqueue: Convert message queue timeout to use hrtimers

The message queue functions mq_timedsend() and mq_timedreceive()
have not yet been converted to use the hrtimer interface.

This patch replaces the call to schedule_timeout() by a call to
schedule_hrtimeout() and transforms the expiration time from
timespec to ktime as required.

[ tglx: Fixed whitespace wreckage ]

Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Tested-by: Pradyumna Sampath <pradysam@gmail.com>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20100402204331.715783034@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

45575f5a 22-Mar-2010 Anton Blanchard <anton@samba.org>

ppc64 sys_ipc breakage in 2.6.34-rc2

I chased down a fail on ppc64 on 2.6.34-rc2 where an application that
uses shared memory was getting a SEGV.

Commit baed7fc9b580bd3fb8252ff1d9b36eaf1f86b670 ("Add generic sys_ipc
wrapper") changed the second argument from an unsigned long to an int.
When we call shmget the system call wrappers for sys_ipc will sign
extend second (ie the size) which truncates it. It took a while to
track down because the call succeeds and strace shows the untruncated
size :)

The patch below changes second from an int to an unsigned long which
fixes shmget on ppc64 (and I assume s390, sparc64 and mips64).

Signed-off-by: Anton Blanchard <anton@samba.org>
--

I assume the function prototypes for the other IPC methods would cause us
to sign or zero extend second where appropriate (avoiding any security
issues). Come to think of it, the syscall wrappers for each method should do
that for us as well.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f1eb1332 10-Mar-2010 Jiri Slaby <jirislaby@kernel.org>

ipc: use rlimit helpers

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in
3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

baed7fc9 10-Mar-2010 Christoph Hellwig <hch@lst.de>

Add generic sys_ipc wrapper

Add a generic implementation of the ipc demultiplexer syscall. Except for
s390 and sparc64 all implementations of the sys_ipc are nearly identical.

There are slight differences in the types of the parameters, where mips
and powerpc as the only 64-bit architectures with sys_ipc use unsigned
long for the "third" argument as it gets casted to a pointer later, while
it traditionally is an "int" like most other paramters. frv goes even
further and uses unsigned long for all parameters execept for "ptr" which
is a pointer type everywhere. The change from int to unsigned long for
"third" and back to "int" for the others on frv should be fine due to the
in-register calling conventions for syscalls (we already had a similar
issue with the generic sys_ptrace), but I'd prefer to have the arch
maintainers looks over this in details.

Except for that h8300, m68k and m68knommu lack an impplementation of the
semtimedop sub call which this patch adds, and various architectures have
gets used - at least on i386 it seems superflous as the compat code on
x86-64 and ia64 doesn't even bother to implement it.

[akpm@linux-foundation.org: add sys_ipc to sys_ni.c]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2329e392 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: fix typo "failues" -> "failures"

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8d8ffefa 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: only set error codes if they are really necessary

... postponing assignments until they're needed. Doesn't change code size.

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

04db0dde 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: simplify do_open() error handling

It reduces code size:
text data bss dec hex filename
9925 72 16 10013 271d ipc/mqueue-BEFORE.o
9885 72 16 9973 26f5 ipc/mqueue-AFTER.o

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8834cf79 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: apply mathematics distributivity on mq_bytes calculation

Code size reduction:
text data bss dec hex filename
9941 72 16 10029 272d ipc/mqueue-BEFORE.o
9925 72 16 10013 271d ipc/mqueue-AFTER.o

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c8308b1c 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: remove unneeded info->messages initialization

... and abort earlier if we couldn't allocate the message pointers array,
avoiding the u->mq_bytes accounting logic.

It reduces code size:
text data bss dec hex filename
9949 72 16 10037 2735 ipc/mqueue-BEFORE.o
9941 72 16 10029 272d ipc/mqueue-AFTER.o

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

4294a8ee 23-Feb-2010 André Goddard Rosa <andre.goddard@gmail.com>

mqueue: fix mq_open() file descriptor leak on user-space processes

We leak fd on lookup_one_len() failure

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ed5e5894 15-Jan-2010 David Howells <dhowells@redhat.com>

nommu: fix SYSV SHM for NOMMU

Commit c4caa778157dbbf04116f0ac2111e389b5cd7a29 ("file
->get_unmapped_area() shouldn't duplicate work of get_unmapped_area()")
broke SYSV SHM for NOMMU by taking away the pointer to
shm_get_unmapped_area() from shm_file_operations.

Put it back conditionally on CONFIG_MMU=n.

file->f_ops->get_unmapped_area() is used to find out the base address for a
mapping of a mappable chardev device or mappable memory-based file (such as a
ramfs file). It needs to be called prior to file->f_ops->mmap() being called.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bac5e54c 16-Dec-2009 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits)
direct I/O fallback sync simplification
ocfs: stop using do_sync_mapping_range
cleanup blockdev_direct_IO locking
make generic_acl slightly more generic
sanitize xattr handler prototypes
libfs: move EXPORT_SYMBOL for d_alloc_name
vfs: force reval of target when following LAST_BIND symlinks (try #7)
ima: limit imbalance msg
Untangling ima mess, part 3: kill dead code in ima
Untangling ima mess, part 2: deal with counters
Untangling ima mess, part 1: alloc_file()
O_TRUNC open shouldn't fail after file truncation
ima: call ima_inode_free ima_inode_free
IMA: clean up the IMA counts updating code
ima: only insert at inode creation time
ima: valid return code from ima_inode_alloc
fs: move get_empty_filp() deffinition to internal.h
Sanitize exec_permission_lite()
Kill cached_lookup() and real_lookup()
Kill path_lookup_open()
...

Trivial conflicts in fs/direct-io.c


b65a9cfc 16-Dec-2009 Al Viro <viro@zeniv.linux.org.uk>

Untangling ima mess, part 2: deal with counters

* do ima_get_count() in __dentry_open()
* stop doing that in followups
* move ima_path_check() to right after nameidata_to_filp()
* don't bump counters on it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0552f879 16-Dec-2009 Al Viro <viro@zeniv.linux.org.uk>

Untangling ima mess, part 1: alloc_file()

There are 2 groups of alloc_file() callers:
* ones that are followed by ima_counts_get
* ones giving non-regular files
So let's pull that ima_counts_get() into alloc_file();
it's a no-op in case of non-regular files.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

2c48b9c4 08-Aug-2009 Al Viro <viro@zeniv.linux.org.uk>

switch alloc_file() to passing struct path

... and have the caller grab both mnt and dentry; kill
leak in infiniband, while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

e5cc9c7b 15-Dec-2009 Amerigo Wang <amwang@redhat.com>

ipc: remove unreachable code in sem.c

This line is unreachable, remove it.

[akpm@linux-foundation.org: remove unneeded initialisation of `err']
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d987f8b2 15-Dec-2009 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize single sops when semval is zero

If multiple simple decrements on the same semaphore are pending, then the
current code scans all decrement operations, even if the semaphore value
is already 0.

The patch optimizes that: if the semaphore value is 0, then there is no
need to scan the q->alter entries.

Note that this is a common case: It happens if 100 decrements by one are
pending and now an increment by one increases the semaphore value from 0
to 1. Without this patch, all 100 entries are scanned. With the patch,
only one entry is scanned, then woken up. Then the new rule triggers and
the scanning is aborted, without looking at the remaining 99 tasks.

With this patch, single sop increment/decrement by 1 are now O(1).
(same as with Nick's patch)

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

636c6be8 15-Dec-2009 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize single semop operations

sysv sem has the concept of semaphore arrays that consist out of multiple
semaphores. Atomic operations that affect multiple semaphores are
supported.

The patch optimizes single semaphore operation calls that affect only one
semaphore: It's not necessary to scan all pending operations, it is
sufficient to scan the per-semaphore list.

The idea is from Nick Piggin version of an ipc sem improvement, the
implementation is different: The code tries to keep as much common code as
possible.

As the result, the patch is simpler, but optimizes fewer cases.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b97e820f 15-Dec-2009 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: add a per-semaphore pending list

Based on Nick's findings:

sysv sem has the concept of semaphore arrays that consist out of multiple
semaphores. Atomic operations that affect multiple semaphores are
supported.

The patch is the first step for optimizing simple, single semaphore
operations: In addition to the global list of all pending operations, a
2nd, per-semaphore list with the simple operations is added.

Note: this patch does not make sense by itself, the new list is used
nowhere.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b6e90822 15-Dec-2009 Manfred Spraul <manfred@colorfullife.com>

ipc/sem.c: optimize if semops fail

Reduce the amount of scanning of the list of pending