History log of /linux-master/fs/binfmt_elf.c
Revision Date Author Comments
# f9c0a39d 28-Sep-2023 Kees Cook <keescook@chromium.org>

binfmt_elf: Only report padzero() errors when PROT_WRITE

Errors with padzero() should be caught unless we're expecting a
pathological (non-writable) segment. Report -EFAULT only when PROT_WRITE
is present.

Additionally add some more documentation to padzero(), elf_map(), and
elf_load().

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Suggested-by: Eric Biederman <ebiederm@xmission.com>
Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20230929032435.2391507-5-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>


# d5ca24f6 28-Sep-2023 Kees Cook <keescook@chromium.org>

binfmt_elf: Use elf_load() for library

While load_elf_library() is a libc5-ism, we can still replace most of
its contents with elf_load() as well, further simplifying the code.

Some historical context:
- libc4 was a.out and used uselib (a.out support has been removed)
- libc5 was ELF and used uselib (there may still be users)
- libc6 is ELF and has never used uselib

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Suggested-by: Eric Biederman <ebiederm@xmission.com>
Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20230929032435.2391507-4-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>


# 8b04d326 28-Sep-2023 Kees Cook <keescook@chromium.org>

binfmt_elf: Use elf_load() for interpreter

Handle arbitrary memsz>filesz in interpreter ELF segments, instead of
only supporting it in the last segment (which is expected to be the
BSS).

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Reported-by: Pedro Falcato <pedro.falcato@gmail.com>
Closes: https://lore.kernel.org/lkml/20221106021657.1145519-1-pedro.falcato@gmail.com/
Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20230929032435.2391507-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>


# 8ed2ef21 28-Sep-2023 Kees Cook <keescook@chromium.org>

binfmt_elf: elf_bss no longer used by load_elf_binary()

With the BSS handled generically via the new filesz/memsz mismatch
handling logic in elf_load(), elf_bss no longer needs to be tracked.
Drop the variable.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Suggested-by: Eric Biederman <ebiederm@xmission.com>
Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20230929032435.2391507-2-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>


# 585a0186 28-Sep-2023 Eric W. Biederman <ebiederm@xmission.com>

binfmt_elf: Support segments with 0 filesz and misaligned starts

Implement a helper elf_load() that wraps elf_map() and performs all
of the necessary work to ensure that when "memsz > filesz" the bytes
described by "memsz > filesz" are zeroed.

An outstanding issue is if the first segment has filesz 0, and has a
randomized location. But that is the same as today.

In this change I replaced an open coded padzero() that did not clear
all of the way to the end of the page, with padzero() that does.

I also stopped checking the return of padzero() as there is at least
one known case where testing for failure is the wrong thing to do.
It looks like binfmt_elf_fdpic may have the proper set of tests
for when error handling can be safely completed.

I found a couple of commits in the old history
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git,
that look very interesting in understanding this code.

commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail")
commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c")
commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2")

Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"):
> commit 39b56d902bf35241e7cba6cc30b828ed937175ad
> Author: Pavel Machek <pavel@ucw.cz>
> Date: Wed Feb 9 22:40:30 2005 -0800
>
> [PATCH] binfmt_elf: clearing bss may fail
>
> So we discover that Borland's Kylix application builder emits weird elf
> files which describe a non-writeable bss segment.
>
> So remove the clear_user() check at the place where we zero out the bss. I
> don't _think_ there are any security implications here (plus we've never
> checked that clear_user() return value, so whoops if it is a problem).
>
> Signed-off-by: Pavel Machek <pavel@suse.cz>
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> Signed-off-by: Linus Torvalds <torvalds@osdl.org>

It seems pretty clear that binfmt_elf_fdpic with skipping clear_user() for
non-writable segments and otherwise calling clear_user(), aka padzero(),
and checking it's return code is the right thing to do.

I just skipped the error checking as that avoids breaking things.

And notably, it looks like Borland's Kylix died in 2005 so it might be
safe to just consider read-only segments with memsz > filesz an error.

Reported-by: Sebastian Ott <sebott@redhat.com>
Reported-by: Thomas Weißschuh <linux@weissschuh.net>
Closes: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/87sf71f123.fsf@email.froward.int.ebiederm.org
Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20230929032435.2391507-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>


# 8d7071af 24-Jun-2023 Linus Torvalds <torvalds@linux-foundation.org>

mm: always expand the stack with the mmap write lock held

This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.

For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.

It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.

As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.

Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f440fa1a 16-Jun-2023 Liam R. Howlett <Liam.Howlett@oracle.com>

mm: make find_extend_vma() fail if write lock not held

Make calls to extend_vma() and find_extend_vma() fail if the write lock
is required.

To avoid making this a flag-day event, this still allows the old
read-locking case for the trivial situations, and passes in a flag to
say "is it write-locked". That way write-lockers can say "yes, I'm
being careful", and legacy users will continue to work in all the common
cases until they have been fully converted to the new world order.

Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa88054b 22-Jun-2023 Baruch Siach <baruch@tkos.co.il>

binfmt_elf: fix comment typo s/reset/regset/

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/0b2967c4a4141875c493e835d5a6f8f2d19ae2d6.1687499804.git.baruch@tkos.co.il


# 60592fb6 11-May-2023 Fangrui Song <maskray@google.com>

coredump, vmcore: Set p_align to 4 for PT_NOTE

Tools like readelf/llvm-readelf use p_align to parse a PT_NOTE program
header as an array of 4-byte entries or 8-byte entries. Currently, there
are workarounds[1] in place for Linux to treat p_align==0 as 4. However,
it would be more appropriate to set the correct alignment so that tools
do not have to rely on guesswork. FreeBSD coredumps set p_align to 4 as
well.

[1]: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=82ed9683ec099d8205dc499ac84febc975235af6
[2]: https://reviews.llvm.org/D150022

Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230512022528.3430327-1-maskray@google.com


# 7435721a 07-Mar-2023 Nick Alcock <nick.alcock@oracle.com>

binfmt_elf: remove MODULE_LICENSE in non-modules

Since commit 8b41fc4454e ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.

So remove it in the files in this commit, none of which can be built as
modules.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>


# 70e79866 28-Feb-2023 Alexey Dobriyan <adobriyan@gmail.com>

ELF: fix all "Elf" typos

ELF is acronym and therefore should be spelled in all caps.

I left one exception at Documentation/arm/nwfpe/nwfpe.rst which looks like
being written in the first person.

Link: https://lkml.kernel.org/r/Y/3wGWQviIOkyLJW@p183
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 317c8194 22-Nov-2022 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

rseq: Introduce feature size and alignment ELF auxiliary vector entries

Export the rseq feature size supported by the kernel as well as the
required allocation alignment for the rseq per-thread area to user-space
through ELF auxiliary vector entries.

This is part of the extensible rseq ABI.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221122203932.231377-3-mathieu.desnoyers@efficios.com


# 19e183b5 22-Dec-2022 Catalin Marinas <catalin.marinas@arm.com>

elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}

A subsequent fix for arm64 will use this parameter to parse the vma
information from the snapshot created by dump_vma_snapshot() rather than
traversing the vma list without the mmap_lock.

Fixes: 6dd8b1a0b6cb ("arm64: mte: Dump the MTE tags in the core file")
Cc: <stable@vger.kernel.org> # 5.18.x
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Seth Jenkins <sethjenkins@google.com>
Suggested-by: Seth Jenkins <sethjenkins@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221222181251.1345752-3-catalin.marinas@arm.com
Signed-off-by: Will Deacon <will@kernel.org>


# 38ba2f11 04-Sep-2022 Al Viro <viro@zeniv.linux.org.uk>

[elf] get rid of get_note_info_size()

it's trivial now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e92edb85 04-Sep-2022 Al Viro <viro@zeniv.linux.org.uk>

[elf] unify regset and non-regset cases

The only real difference is in filling per-thread notes - getting
the values of registers. And this is the only part that is worth
an ifdef - we don't need to duplicate the logics regarding gathering
threads, filling other notes, etc.

It would've been hard to do back when regset-based variant had been
introduced, mostly due to sharing bits and pieces of helpers with
aout coredumps. As the result, too much had been duplicated and
the copies had drifted away since then. Now it can be done cleanly...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e961d370 04-Sep-2022 Al Viro <viro@zeniv.linux.org.uk>

[elf][non-regset] use elf_core_copy_task_regs() for dumper as well

elf_core_copy_regs() is equivalent to elf_core_copy_task_regs() of
current on all architectures.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# bdbadfcc 03-Sep-2022 Al Viro <viro@zeniv.linux.org.uk>

[elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument)

Don't bother with pointless macros - we are not sharing it with aout coredumps
anymore. Just convert the underlying functions to the same arguments (nobody
uses regs, actually) and call them elf_core_copy_task_fpregs(). And unexport
the entire bunch, while we are at it.

[added missing includes in arch/{csky,m68k,um}/kernel/process.c to avoid extra
warnings about the lack of externs getting added to huge piles for those
files. Pointless, but...]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# dc64cc12 14-Nov-2022 Bo Liu <liubo03@inspur.com>

binfmt_elf: replace IS_ERR() with IS_ERR_VALUE()

Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE()
instead.

Signed-off-by: Bo Liu <liubo03@inspur.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221115031757.2426-1-liubo03@inspur.com


# ef20c513 19-Oct-2022 Rolf Eike Beer <eb@emlix.com>

binfmt_elf: simplify error handling in load_elf_phdrs()

The err variable was the same like retval, but capped to <= 0. This is the
same as retval as elf_read() never returns positive values.

Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/4137126.7Qn9TF0dmF@mobilepool36.emlix.com


# cfc46ca4 19-Oct-2022 Rolf Eike Beer <eb@emlix.com>

binfmt_elf: fix documented return value for load_elf_phdrs()

This function has never returned anything but a plain NULL.

Fixes: 6a8d38945cf4 ("binfmt_elf: Hoist ELF program header loading to a function")
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/2359389.EDbqzprbEW@mobilepool36.emlix.com


# 8f6e3f9e 18-Oct-2022 Kees Cook <keescook@chromium.org>

binfmt: Fix whitespace issues

Fix the annoying whitespace issues that have been following these files
around for years.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20221018071350.never.230-kees@kernel.org


# 4b0e21d6 08-Jun-2020 Al Viro <viro@zeniv.linux.org.uk>

[elf][regset] simplify thread list handling in fill_note_info()

fill_note_info() iterates through the list of threads collected in
mm->core_state->dumper, allocating a struct elf_thread_core_info
instance for each and linking those into a list.

We need the entry corresponding to current to be first in the
resulting list, so the logics for list insertion is
if it's for current or list is empty
insert in the head
else
insert after the first element

However, in mm->core_state->dumper the entry for current is guaranteed
to be the first one. Which means that both parts of condition will
be true on the first iteration and neither will be true on all subsequent
ones.

Taking the first iteration out of the loop simplifies things nicely...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 922ef161 04-Sep-2022 Al Viro <viro@zeniv.linux.org.uk>

[elf][regset] clean fill_note_info() a bit

*info is already initialized...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9a938eba 07-Jun-2020 Al Viro <viro@zeniv.linux.org.uk>

kill coredump_params->regs

it's always task_pt_regs(current)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 594d2a14 24-Oct-2022 Li Zetao <lizetao1@huawei.com>

fs/binfmt_elf: Fix memory leak in load_elf_binary()

There is a memory leak reported by kmemleak:

unreferenced object 0xffff88817104ef80 (size 224):
comm "xfs_admin", pid 47165, jiffies 4298708825 (age 1333.476s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
60 a8 b3 00 81 88 ff ff a8 10 5a 00 81 88 ff ff `.........Z.....
backtrace:
[<ffffffff819171e1>] __alloc_file+0x21/0x250
[<ffffffff81918061>] alloc_empty_file+0x41/0xf0
[<ffffffff81948cda>] path_openat+0xea/0x3d30
[<ffffffff8194ec89>] do_filp_open+0x1b9/0x290
[<ffffffff8192660e>] do_open_execat+0xce/0x5b0
[<ffffffff81926b17>] open_exec+0x27/0x50
[<ffffffff81a69250>] load_elf_binary+0x510/0x3ed0
[<ffffffff81927759>] bprm_execve+0x599/0x1240
[<ffffffff8192a997>] do_execveat_common.isra.0+0x4c7/0x680
[<ffffffff8192b078>] __x64_sys_execve+0x88/0xb0
[<ffffffff83bbf0a5>] do_syscall_64+0x35/0x80

If "interp_elf_ex" fails to allocate memory in load_elf_binary(),
the program will take the "out_free_ph" error handing path,
resulting in "interpreter" file resource is not released.

Fix it by adding an error handing path "out_free_file", which will
release the file resource when "interp_elf_ex" failed to allocate
memory.

Fixes: 0693ffebcfe5 ("fs/binfmt_elf.c: allocate less for static executable")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20221024154421.982230-1-lizetao1@huawei.com


# aeb79237 14-Apr-2022 Andrew Morton <akpm@linux-foundation.org>

revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"

Despite Mike's attempted fix (925346c129da117122), regressions reports
continue:

https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
https://bugzilla.kernel.org/show_bug.cgi?id=215720
https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

So revert this patch.

Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 354e923d 14-Apr-2022 Andrew Morton <akpm@linux-foundation.org>

revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"

Commit 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for
loaders") was an attempt to fix regressions due to 9630f0d60fec5f
("fs/binfmt_elf: use PT_LOAD p_align values for static PIE").

But regressionss continue to be reported:

https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
https://bugzilla.kernel.org/show_bug.cgi?id=215720
https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

This patch reverts the fix, so the original can also be reverted.

Fixes: 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dd664099 17-Mar-2022 Rick Edgecombe <rick.p.edgecombe@intel.com>

binfmt_elf: Don't write past end of notes for regset gap

In fill_thread_core_info() the ptrace accessible registers are collected
to be written out as notes in a core file. The note array is allocated
from a size calculated by iterating the user regset view, and counting the
regsets that have a non-zero core_note_type. However, this only allows for
there to be non-zero core_note_type at the end of the regset view. If
there are any gaps in the middle, fill_thread_core_info() will overflow the
note allocation, as it iterates over the size of the view and the
allocation would be smaller than that.

There doesn't appear to be any arch that has gaps such that they exceed
the notes allocation, but the code is brittle and tries to support
something it doesn't. It could be fixed by increasing the allocation size,
but instead just have the note collecting code utilize the array better.
This way the allocation can stay smaller.

Even in the case of no arch's that have gaps in their regset views, this
introduces a change in the resulting indicies of t->notes. It does not
introduce any changes to the core file itself, because any blank notes are
skipped in write_note_info().

In case, the allocation logic between fill_note_info() and
fill_thread_core_info() ever diverges from the usage logic, warn and skip
writing any notes that would overflow the array.

This fix is derrived from an earlier one[0] by Yu-cheng Yu.

[0] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220317192013.13655-4-rick.p.edgecombe@intel.com


# 390031c9 08-Mar-2022 Eric W. Biederman <ebiederm@xmission.com>

coredump: Use the vma snapshot in fill_files_note

Matthew Wilcox reported that there is a missing mmap_lock in
file_files_note that could possibly lead to a user after free.

Solve this by using the existing vma snapshot for consistency
and to avoid the need to take the mmap_lock anywhere in the
coredump code except for dump_vma_snapshot.

Update the dump_vma_snapshot to capture vm_pgoff and vm_file
that are neeeded by fill_files_note.

Add free_vma_snapshot to free the captured values of vm_file.

Reported-by: Matthew Wilcox <willy@infradead.org>
Link: https://lkml.kernel.org/r/20220131153740.2396974-1-willy@infradead.org
Cc: stable@vger.kernel.org
Fixes: a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Fixes: 2aa362c49c31 ("coredump: extend core dump note section to contain file names of mapped files")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 9ec7d323 30-Jan-2022 Eric W. Biederman <ebiederm@xmission.com>

coredump/elf: Pass coredump_params into fill_note_info

Instead of individually passing cprm->siginfo and cprm->regs
into fill_note_info pass all of struct coredump_params.

This is preparation to allow fill_files_note to use the existing
vma snapshot.

Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 95c5436a 07-Mar-2022 Eric W. Biederman <ebiederm@xmission.com>

coredump: Snapshot the vmas in do_coredump

Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the
individual coredump routines into do_coredump itself. This makes
the code less error prone and easier to maintain.

Make the vma snapshot available to the coredump routines
in struct coredump_params. This makes it easier to
change and update what is captures in the vma snapshot
and will be needed for fixing fill_file_notes.

Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 9e1a3ce0 23-Feb-2022 Kees Cook <keescook@chromium.org>

binfmt_elf: Introduce KUnit test

Adds simple KUnit test for some binfmt_elf internals: specifically a
regression test for the problem fixed by commit 8904d9cd90ee ("ELF:
fix overflow in total mapping size calculation").

$ ./tools/testing/kunit/kunit.py run --arch x86_64 \
--kconfig_add CONFIG_IA32_EMULATION=y '*binfmt_elf'
...
[19:41:08] ================== binfmt_elf (1 subtest) ==================
[19:41:08] [PASSED] total_mapping_size_test
[19:41:08] =================== [PASSED] binfmt_elf ====================
[19:41:08] ============== compat_binfmt_elf (1 subtest) ===============
[19:41:08] [PASSED] total_mapping_size_test
[19:41:08] ================ [PASSED] compat_binfmt_elf ================
[19:41:08] ============================================================
[19:41:08] Testing complete. Passed: 2, Failed: 0, Crashed: 0, Skipped: 0, Errors: 0

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: David Gow <davidgow@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Magnus Groß" <magnus.gross@rwth-aachen.de>
Cc: kunit-dev@googlegroups.com
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
v1: https://lore.kernel.org/lkml/20220224054332.1852813-1-keescook@chromium.org
v2:
- improve commit log
- fix comment URL (Daniel)
- drop redundant KUnit Kconfig help info (Daniel)
- note in Kconfig help that COMPAT builds add a compat test (David)


# 2b4bfbe0 27-Jan-2022 Akira Kawata <akirakawata1@gmail.com>

fs/binfmt_elf: Refactor load_elf_binary function

I delete load_addr because it is not used anymore. And I rename
load_addr_set to first_pt_load because it is used only to capture the
first iteration of the loop.

Signed-off-by: Akira Kawata <akirakawata1@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127124014.338760-3-akirakawata1@gmail.com


# 0da1d500 27-Jan-2022 Akira Kawata <akirakawata1@gmail.com>

fs/binfmt_elf: Fix AT_PHDR for unusual ELF files

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=197921

As pointed out in the discussion of buglink, we cannot calculate AT_PHDR
as the sum of load_addr and exec->e_phoff.

: The AT_PHDR of ELF auxiliary vectors should point to the memory address
: of program header. But binfmt_elf.c calculates this address as follows:
:
: NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
:
: which is wrong since e_phoff is the file offset of program header and
: load_addr is the memory base address from PT_LOAD entry.
:
: The ld.so uses AT_PHDR as the memory address of program header. In normal
: case, since the e_phoff is usually 64 and in the first PT_LOAD region, it
: is the correct program header address.
:
: But if the address of program header isn't equal to the first PT_LOAD
: address + e_phoff (e.g. Put the program header in other non-consecutive
: PT_LOAD region), ld.so will try to read program header from wrong address
: then crash or use incorrect program header.

This is because exec->e_phoff
is the offset of PHDRs in the file and the address of PHDRs in the
memory may differ from it. This patch fixes the bug by calculating the
address of program headers from PT_LOADs directly.

Signed-off-by: Akira Kawata <akirakawata1@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127124014.338760-2-akirakawata1@gmail.com


# d65bc29b 13-Feb-2022 Alexey Dobriyan <adobriyan@gmail.com>

binfmt: move more stuff undef CONFIG_COREDUMP

struct linux_binfmt::core_dump and struct min_coredump::min_coredump
are used under CONFIG_COREDUMP only. Shrink those embedded configs
a bit.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/YglbIFyN+OtwVyjW@localhost.localdomain


# 10b19249 03-Oct-2021 Alexey Dobriyan <adobriyan@gmail.com>

ELF: fix overflow in total mapping size calculation

Kernel assumes that ELF program headers are ordered by mapping address,
but doesn't enforce it. It is possible to make mapping size extremely huge
by simply shuffling first and last PT_LOAD segments.

As long as PT_LOAD segments do not overlap, it is silly to require
sorting by v_addr anyway because mmap() doesn't care.

Don't assume PT_LOAD segments are sorted and calculate min and max
addresses correctly.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Tested-by: "Magnus Groß" <magnus.gross@rwth-aachen.de>
Link: https://lore.kernel.org/all/Yfqm7HbucDjPbES+@fractal.localdomain/
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/lkml/YVmd7D0M6G%2FDcP4O@localhost.localdomain


# 439a8468 28-Feb-2022 Kees Cook <keescook@chromium.org>

binfmt_elf: Avoid total_mapping_size for ET_EXEC

Partially revert commit 5f501d555653 ("binfmt_elf: reintroduce using
MAP_FIXED_NOREPLACE"), which applied the ET_DYN "total_mapping_size"
logic also to ET_EXEC.

At least ia64 has ET_EXEC PT_LOAD segments that are not virtual-address
contiguous (but _are_ file-offset contiguous). This would result in a
giant mapping attempting to cover the entire span, including the virtual
address range hole, and well beyond the size of the ELF file itself,
causing the kernel to refuse to load it. For example:

$ readelf -lW /usr/bin/gcc
...
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz ...
...
LOAD 0x000000 0x4000000000000000 0x4000000000000000 0x00b5a0 0x00b5a0 ...
LOAD 0x00b5a0 0x600000000000b5a0 0x600000000000b5a0 0x0005ac 0x000710 ...
...
^^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^

File offset range : 0x000000-0x00bb4c
0x00bb4c bytes

Virtual address range : 0x4000000000000000-0x600000000000bcb0
0x200000000000bcb0 bytes

Remove the total_mapping_size logic for ET_EXEC, which reduces the
ET_EXEC MAP_FIXED_NOREPLACE coverage to only the first PT_LOAD (better
than nothing), and retains it for ET_DYN.

Ironically, this is the reverse of the problem that originally caused
problems with MAP_FIXED_NOREPLACE: overlapping PT_LOAD segments. Future
work could restore full coverage if load_elf_binary() were to perform
mappings in a separate phase from the loading (where it could resolve
both overlaps and holes).

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Reported-by: matoro <matoro_bugzilla_kernel@matoro.tk>
Fixes: 5f501d555653 ("binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE")
Link: https://lore.kernel.org/r/a3edd529-c42d-3b09-135c-7e98a15b150f@leemhuis.info
Tested-by: matoro <matoro_mailinglist_kernel@matoro.tk>
Link: https://lore.kernel.org/lkml/ce8af9c13bcea9230c7689f3c1e0e2cd@matoro.tk
Tested-By: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Link: https://lore.kernel.org/lkml/49182d0d-708b-4029-da5f-bc18603440a6@physik.fu-berlin.de
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>


# 925346c1 11-Feb-2022 Mike Rapoport <rppt@kernel.org>

fs/binfmt_elf: fix PT_LOAD p_align values for loaders

Rui Salvaterra reported that Aisleroit solitaire crashes with "Wrong
__data_start/_end pair" assertion from libgc after update to v5.17-rc1.

Bisection pointed to commit 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD
p_align values for static PIE") that fixed handling of static PIEs, but
made the condition that guards load_bias calculation to exclude loader
binaries.

Restoring the check for presence of interpreter fixes the problem.

Link: https://lkml.kernel.org/r/20220202121433.3697146-1-rppt@kernel.org
Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Rui Salvaterra <rsalvaterra@gmail.com>
Tested-by: Rui Salvaterra <rsalvaterra@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H.J. Lu" <hjl.tools@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9630f0d6 19-Jan-2022 H.J. Lu <hjl.tools@gmail.com>

fs/binfmt_elf: use PT_LOAD p_align values for static PIE

Extend commit ce81bb256a22 ("fs/binfmt_elf: use PT_LOAD p_align values
for suitable start address") which fixed PIE binaries built with
-Wl,-z,max-page-size=0x200000, to cover static PIE binaries. This
fixes:

https://bugzilla.kernel.org/show_bug.cgi?id=215275

Tested by verifying static PIE binaries with -Wl,-z,max-page-size=0x200000 loading.

Link: https://lkml.kernel.org/r/20211209174052.370537-1-hjl.tools@gmail.com
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 95af469c 19-Jan-2022 Yafang Shao <laoar.shao@gmail.com>

fs/binfmt_elf: replace open-coded string copy with get_task_comm

It is better to use get_task_comm() instead of the open coded string
copy as we do in other places.

struct elf_prpsinfo is used to dump the task information in userspace
coredump or kernel vmcore. Below is the verification of vmcore,

crash> ps
PID PPID CPU TASK ST %MEM VSZ RSS COMM
0 0 0 ffffffff9d21a940 RU 0.0 0 0 [swapper/0]
> 0 0 1 ffffa09e40f85e80 RU 0.0 0 0 [swapper/1]
> 0 0 2 ffffa09e40f81f80 RU 0.0 0 0 [swapper/2]
> 0 0 3 ffffa09e40f83f00 RU 0.0 0 0 [swapper/3]
> 0 0 4 ffffa09e40f80000 RU 0.0 0 0 [swapper/4]
> 0 0 5 ffffa09e40f89f80 RU 0.0 0 0 [swapper/5]
0 0 6 ffffa09e40f8bf00 RU 0.0 0 0 [swapper/6]
> 0 0 7 ffffa09e40f88000 RU 0.0 0 0 [swapper/7]
> 0 0 8 ffffa09e40f8de80 RU 0.0 0 0 [swapper/8]
> 0 0 9 ffffa09e40f95e80 RU 0.0 0 0 [swapper/9]
> 0 0 10 ffffa09e40f91f80 RU 0.0 0 0 [swapper/10]
> 0 0 11 ffffa09e40f93f00 RU 0.0 0 0 [swapper/11]
> 0 0 12 ffffa09e40f90000 RU 0.0 0 0 [swapper/12]
> 0 0 13 ffffa09e40f9bf00 RU 0.0 0 0 [swapper/13]
> 0 0 14 ffffa09e40f98000 RU 0.0 0 0 [swapper/14]
> 0 0 15 ffffa09e40f9de80 RU 0.0 0 0 [swapper/15]

It works well as expected.

Some comments are added to explain why we use the hard-coded 16.

Link: https://lkml.kernel.org/r/20211120112738.45980-5-laoar.shao@gmail.com
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a43e5e3a 08-Nov-2021 Alexey Dobriyan <adobriyan@gmail.com>

ELF: simplify STACK_ALLOC macro

"A -= B; A" is equivalent to "A -= B".

Link: https://lkml.kernel.org/r/YVmcP256fRMqCwgK@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5f501d55 08-Nov-2021 Kees Cook <keescook@chromium.org>

binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE

Commit b212921b13bd ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") reverted back to using MAP_FIXED to map ELF LOAD
segments because it was found that the segments in some binaries overlap
and can cause MAP_FIXED_NOREPLACE to fail.

The original intent of MAP_FIXED_NOREPLACE in the ELF loader was to
prevent the silent clobbering of an existing mapping (e.g. stack) by
the ELF image, which could lead to exploitable conditions. Quoting
commit 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map"),
which originally introduced the use of MAP_FIXED_NOREPLACE in the
loader:

Both load_elf_interp and load_elf_binary rely on elf_map to map
segments [to a specific] address and they use MAP_FIXED to enforce
that. This is however [a] dangerous thing prone to silent data
corruption which can be even exploitable.
...
Let's take CVE-2017-1000253 as an example ... we could end up mapping
[the executable] over the existing stack ... The [stack layout] issue
has been fixed since then ... So we should be safe and any [similar]
attack should be impractical. On the other hand this is just too
subtle [an] assumption ... it can break quite easily and [be] hard to
spot.
...
Address this [weakness] by changing MAP_FIXED to the newly added
MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is
an existing mapping clashing with the requested one [instead of
silently] clobbering it.

Then processing ET_DYN binaries the loader already calculates a total
size for the image when the first segment is mapped, maps the entire
image, and then unmaps the remainder before the remaining segments are
then individually mapped.

To avoid the earlier problems (legitimate overlapping LOAD segments
specified in the ELF), apply the same logic to ET_EXEC binaries as well.

For both ET_EXEC and ET_DYN+INTERP use MAP_FIXED_NOREPLACE for the
initial total size mapping and then use MAP_FIXED to build the final
(possibly legitimately overlapping) mappings. For ET_DYN w/out INTERP,
continue to map at a system-selected address in the mmap region.

Link: https://lkml.kernel.org/r/20210916215947.3993776-1-keescook@chromium.org
Link: https://lore.kernel.org/lkml/1595869887-23307-2-git-send-email-anthony.yznaga@oracle.com
Co-developed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Chen Jingwen <chenjingwen6@huawei.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0258b5fd 22-Sep-2021 Eric W. Biederman <ebiederm@xmission.com>

coredump: Limit coredumps to a single thread group

Today when a signal is delivered with a handler of SIG_DFL whose
default behavior is to generate a core dump not only that process but
every process that shares the mm is killed.

In the case of vfork this looks like a real world problem. Consider
the following well defined sequence.

if (vfork() == 0) {
execve(...);
_exit(EXIT_FAILURE);
}

If a signal that generates a core dump is received after vfork but
before the execve changes the mm the process that called vfork will
also be killed (as the mm is shared).

Similarly if the execve fails after the point of no return the kernel
delivers SIGSEGV which will kill both the exec'ing process and because
the mm is shared the process that called vfork as well.

As far as I can tell this behavior is a violation of people's
reasonable expectations, POSIX, and is unnecessarily fragile when the
system is low on memory.

Solve this by making a userspace visible change to only kill a single
process/thread group. This is possible because Jann Horn recently
modified[1] the coredump code so that the mm can safely be modified
while the coredump is happening. With LinuxThreads long gone I don't
expect anyone to have a notice this behavior change in practice.

To accomplish this move the core_state pointer from mm_struct to
signal_struct, which allows different thread groups to coredump
simultatenously.

In zap_threads remove the work to kill anything except for the current
thread group.

v2: Remove core_state from the VM_BUG_ON_MM print to fix
compile failure when CONFIG_DEBUG_VM is enabled.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>

[1] a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.au
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 9b2f72cc 28-Sep-2021 Chen Jingwen <chenjingwen6@huawei.com>

elf: don't use MAP_FIXED_NOREPLACE for elf interpreter mappings

In commit b212921b13bd ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
load_elf_interp.

Unfortunately, this will cause kernel to fail to start with:

1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
Failed to execute /init (error -17)

The reason is that the elf interpreter (ld.so) has overlapping segments.

readelf -l ld-2.31.so
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000002c94c 0x000000000002c94c R E 0x10000
LOAD 0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
0x00000000000021e8 0x0000000000002320 RW 0x10000
LOAD 0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
0x00000000000011ac 0x0000000000001328 RW 0x10000

The reason for this problem is the same as described in commit
ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments").

Not only executable binaries, elf interpreters (e.g. ld.so) can have
overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
back to MAP_FIXED in load_elf_interp.

Fixes: 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map")
Cc: <stable@vger.kernel.org> # v4.19
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chen Jingwen <chenjingwen6@huawei.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4589ff7c 23-Apr-2021 David Hildenbrand <david@redhat.com>

binfmt: remove in-tree usage of MAP_DENYWRITE

At exec time when we mmap the new executable via MAP_DENYWRITE we have it
opened via do_open_execat() and already deny_write_access()'ed the file
successfully. Once exec completes, we allow_write_acces(); however,
we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
also deny_write_access() as long as mm->exe_file remains set. We'll
effectively deny write access to our executable via mm->exe_file
until mm->exe_file is changed -- when the process is removed, on new
exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).

Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
mm->exe_file.

In case of an elf interpreter, we'll now only deny write access to the file
during exec. This is somewhat okay, because the interpreter behaves
(and sometime is) a shared library; all shared libraries, especially the
ones loaded directly in user space like via dlopen() won't ever be mapped
via MAP_DENYWRITE, because we ignore that from user space completely;
these shared libraries can always be modified while mapped and executed.
Let's only special-case the main executable, denying write access while
being executed by a process. This can be considered a minor user space
visible change.

While this is a cleanup, it also fixes part of a problem reported with
VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
this patch and will be removed next:
"Overlayfs did not honor positive i_writecount on realfile for
VM_DENYWRITE mappings." [1]

[1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/

Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>


# 42be8b42 21-Apr-2021 David Hildenbrand <david@redhat.com>

binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()

uselib() is the legacy systemcall for loading shared libraries.
Nowadays, applications use dlopen() to load shared libraries, completely
implemented in user space via mmap().

For example, glibc uses MAP_COPY to mmap shared libraries. While this
maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
MAP_DENYWRITE specification from user space in mmap.

With this change, all remaining in-tree users of MAP_DENYWRITE use it
to map an executable. We will be able to open shared libraries loaded
via uselib() writable, just as we already can via dlopen() from user
space.

This is one step into the direction of removing MAP_DENYWRITE from the
kernel. This can be considered a minor user space visible change.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>


# a4eec6a3 28-Jun-2021 David Hildenbrand <david@redhat.com>

binfmt: remove in-tree usage of MAP_EXECUTABLE

Ever since commit e9714acf8c43 ("mm: kill vma flag VM_EXECUTABLE and
mm->num_exe_file_vmas"), VM_EXECUTABLE is gone and MAP_EXECUTABLE is
essentially completely ignored. Let's remove all usage of MAP_EXECUTABLE.

[akpm@linux-foundation.org: fix blooper in fs/binfmt_aout.c. per David]

Link: https://lkml.kernel.org/r/20210421093453.6904-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2f064a59 11-Jun-2021 Peter Zijlstra <peterz@infradead.org>

sched: Change task_struct::state

Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org


# d0f1088b 08-Mar-2020 Al Viro <viro@zeniv.linux.org.uk>

coredump: don't bother with do_truncate()

have dump_skip() just remember how much needs to be skipped,
leave actual seeks/writing zeroes to the next dump_emit()
or the end of coredump output, whichever comes first.
And instead of playing with do_truncate() in the end, just
write one NUL at the end of the last gap (if any).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2347961b 28-Jan-2020 Laurent Vivier <laurent@vivier.eu>

binfmt_misc: pass binfmt_misc flags to the interpreter

It can be useful to the interpreter to know which flags are in use.

For instance, knowing if the preserve-argv[0] is in use would
allow to skip the pathname argument.

This patch uses an unused auxiliary vector, AT_FLAGS, to add a
flag to inform interpreter if the preserve-argv[0] is enabled.

Note by Helge Deller:
The real-world user of this patch is qemu-user, which needs to know
if it has to preserve the argv[0]. See Debian bug #970460.

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: YunQiang Su <ysu@wavecomp.com>
URL: http://bugs.debian.org/970460
Signed-off-by: Helge Deller <deller@gmx.de>


# f2485a2d 12-Jun-2020 Al Viro <viro@zeniv.linux.org.uk>

elf_prstatus: collect the common part (everything before pr_reg) into a struct

Preparations to doing i386 compat elf_prstatus sanely - rather than duplicating
the beginning of compat_elf_prstatus, take these fields into a separate
structure (compat_elf_prstatus_common), so that it could be reused. Due to
the incestous relationship between binfmt_elf.c and compat_binfmt_elf.c we
need the same shape change done to native struct elf_prstatus, gathering the
fields prior to pr_reg into a new structure (struct elf_prstatus_common).

Fortunately, offset of pr_reg is always a multiple of 16 with no padding
right before it, so it's possible to turn all the stuff prior to it into
a single member without disturbing the layout.

[build fix from Geert Uytterhoeven folded in]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8a00dd00 22-Jun-2020 Al Viro <viro@zeniv.linux.org.uk>

binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID

On 64bit architectures that support 32bit processes there are
two possible layouts for NT_PRSTATUS note in ELF coredumps.
For one thing, several fields are 64bit for native processes
and 32bit for compat ones (pr_sigpend, etc.). For another,
the register dump is obviously different - the size and number
of registers are not going to be the same for 32bit and 64bit
variants of processor.

Usually that's handled by having two structures - elf_prstatus
for native layout and compat_elf_prstatus for 32bit one.
32bit processes are handled by fs/compat_binfmt_elf.c, which
defines a macro called 'elf_prstatus' that expands to compat_elf_prstatus.
Then it includes fs/binfmt_elf.c, which makes all references to
struct elf_prstatus to be textually replaced with struct
compat_elf_prstatus. Ugly and somewhat brittle, but it works.

However, amd64 is worse - there are _three_ possible layouts.
One for native 64bit processes, another for i386 (32bit) processes
and yet another for x32 (32bit address space with full 64bit
registers).

Both i386 and x32 processes are handled by fs/compat_binfmt_elf.c,
with usual compat_binfmt_elf.c trickery. However, the layouts
for i386 and x32 are not identical - they have the common beginning,
but the register dump part (pr_reg) is bigger on x32. Worse, pr_reg
is not the last field - it's followed by int pr_fpvalid, so that
field ends up at different offsets for i386 and x32 layouts.

Fortunately, there's not much code that cares about any of that -
it's all encapsulated in fill_thread_core_info(). Since x32
variant is bigger, we define compat_elf_prstatus to match that
layout. That way i386 processes have enough space to fit
their layout into.

Moreover, since these layouts are identical prior to pr_reg,
we don't need to distinguish x32 and i386 cases when we are
setting the fields prior to pr_reg.

Filling pr_reg itself is done by calling ->get() method of
appropriate regset, and that method knows what layout (and size)
to use.

We do need to distinguish x32 and i386 cases only for two
things: setting ->pr_fpvalid (offset differs for x32 and
i386) and choosing the right size for our note.

The way it's done is Not Nice, for the lack of more accurate
printable description. There are two macros (PRSTATUS_SIZE and
SET_PR_FPVALID), that default essentially to sizeof(struct elf_prstatus)
and (S)->pr_fpvalid = 1. On x86 asm/compat.h provides its own
variants.

Unfortunately, quite a few things go wrong there:
* PRSTATUS_SIZE doesn't use the normal test for process
being an x32 one; it compares the size reported by regset with
the size of pr_reg.
* it hardcodes the sizes of x32 and i386 variants (296 and 144
resp.), so if some change in includes leads to asm/compat.h pulled
in by fs/binfmt_elf.c we are in trouble - it will end up using
the size of x32 variant for 64bit processes.
* it's in the wrong place; asm/compat.h couldn't define
the structure for i386 layout, since it lacks quite a few types
needed for it. Hardcoded sizes are largely due to that.

The proper fix would be to have an explicitly defined i386 variant
of structure and have PRSTATUS_SIZE/SET_PR_FPVALID check for
TIF_X32 to choose the variant that should be used. Unfortunately,
that requires some manipulations of headers; we'll do that later
in the series, but for now let's go with the minimal variant -
rename PRSTATUS_SIZE in asm/compat.h to COMPAT_PRSTATUS_SIZE,
have fs/compat_binfmt_elf.c define PRSTATUS_SIZE to COMPAT_PRSTATUS_SIZE
and use the normal TIF_X32 check in that macro. The size of i386 variant
is kept hardcoded for now. Similar story for SET_PR_FPVALID.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c39ab6de 25-Nov-2020 Eric W. Biederman <ebiederm@xmission.com>

coredump: Document coredump code exclusively used by cell spufs

Oleg Nesterov recently asked[1] why is there an unshare_files in
do_coredump. After digging through all of the callers of lookup_fd it
turns out that it is
arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
that needs the unshare_files in do_coredump.

Looking at the history[2] this code was also the only piece of coredump code
that required the unshare_files when the unshare_files was added.

Looking at that code it turns out that cell is also the only
architecture that implements elf_coredump_extra_notes_size and
elf_coredump_extra_notes_write.

I looked at the gdb repo[3] support for cell has been removed[4] in binutils
2.34. Geoff Levand reports he is still getting questions on how to
run modern kernels on the PS3, from people using 3rd party firmware so
this code is not dead. According to Wikipedia the last PS3 shipped in
Japan sometime in 2017. So it will probably be a little while before
everyone's hardware dies.

Add some comments briefly documenting the coredump code that exists
only to support cell spufs to make it easier to understand the
coredump code. Eventually the hardware will be dead, or their won't
be userspace tools, or the coredump code will be refactored and it
will be too difficult to update a dead architecture and these comments
make it easy to tell where to pull to remove cell spufs support.

[1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
[2] 179e037fc137 ("do_coredump(): make sure that descriptor table isn't shared")
[3] git://sourceware.org/git/binutils-gdb.git
[4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 5e01fdff 31-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

fs: Replace zero-length array with flexible-array member

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# 9a29a671 03-Oct-2020 Gabriel Krisman Bertazi <krisman@collabora.com>

elf: Expose ELF header on arch_setup_additional_pages()

Like it is done for SET_PERSONALITY with ARM, which requires the ELF
header to select correct personality parameters, x86 requires the
headers when selecting which VDSO to load, instead of relying on the
going-away TIF_IA32/X32 flags.

Add an indirection macro to arch_setup_additional_pages(), that x86 can
reimplement to receive the extra parameter just for ELF files. This
requires no changes to other architectures, who can continue to use the
original arch_setup_additional_pages for ELF and non-ELF binaries.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201004032536.1229030-8-krisman@collabora.com


# bc3d7bf6 03-Oct-2020 Gabriel Krisman Bertazi <krisman@collabora.com>

elf: Expose ELF header in compat_start_thread()

Like it is done for SET_PERSONALITY with x86, which requires the ELF header
to select correct personality parameters, x86 requires the headers on
compat_start_thread() to choose starting CS for ELF32 binaries, instead of
relying on the going-away TIF_IA32/X32 flags.

Add an indirection macro to ELF invocations of START_THREAD, that x86 can
reimplement to receive the extra parameter just for ELF files. This
requires no changes to other architectures who don't need the header
information, they can continue to use the original start_thread for ELF and
non-ELF binaries, and it prevents affecting non-ELF code paths for x86.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201004032536.1229030-6-krisman@collabora.com


# b2767d97 17-Oct-2020 Jann Horn <jannh@google.com>

binfmt_elf: take the mmap lock around find_extend_vma()

create_elf_tables() runs after setup_new_exec(), so other tasks can
already access our new mm and do things like process_madvise() on it. (At
the time I'm writing this commit, process_madvise() is not in mainline
yet, but has been in akpm's tree for some time.)

While I believe that there are currently no APIs that would actually allow
another process to mess up our VMA tree (process_madvise() is limited to
MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
under which no syscalls have been executed yet), this seems like an
accident waiting to happen.

Let's make sure that we always take the mmap lock around GUP paths as long
as another process might be able to see the mm.

(Yes, this diff looks suspicious because we drop the lock before doing
anything with `vma`, but that's because we actually don't do anything with
it apart from the NULL check.)

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a07279c9 15-Oct-2020 Jann Horn <jannh@google.com>

binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot

In both binfmt_elf and binfmt_elf_fdpic, use a new helper
dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
VMA, if we have one) while protected by the mmap_lock, and then use that
snapshot instead of walking the VMA list without locking.

An alternative approach would be to keep the mmap_lock held across the
entire core dumping operation; however, keeping the mmap_lock locked while
we may be blocked for an unbounded amount of time (e.g. because we're
dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
blocks things like the ->release handler of userfaultfd, and we don't
really want critical system daemons to grind to a halt just because
someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
something like that.

Since both the normal ELF code and the FDPIC ELF code need this
functionality (and if any other binfmt wants to add coredump support in
the future, they'd probably need it, too), implement this with a common
helper in fs/coredump.c.

A downside of this approach is that we now need a bigger amount of kernel
memory per userspace VMA in the normal ELF case, and that we need O(n)
kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
be terribly bad.

There currently is a data race between stack expansion and anything that
reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
mitigate that for core dumping, take the mmap_lock in write mode when
taking a snapshot of the VMA hierarchy. (If we only took the mmap_lock in
read mode, we could end up with a corrupted core dump if someone does
get_user_pages_remote() concurrently. Not really a major problem, but
taking the mmap_lock either way works here, so we might as well avoid the
issue.) (This doesn't do anything about the existing data races with stack
expansion in other mm code.)

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 429a22e7 15-Oct-2020 Jann Horn <jannh@google.com>

coredump: rework elf/elf_fdpic vma_dump_size() into common helper

At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
different code to figure out which VMAs should be dumped, and if so,
whether the dump should contain the entire VMA or just its first page.

Eliminate duplicate code by reworking the binfmt_elf version into a
generic core dumping helper in coredump.c.

As part of that, change the heuristic for detecting executable/library
header pages to check whether the inode is executable instead of looking
at the file mode.

This is less problematic in terms of locking because it lets us avoid
get_user() under the mmap_sem. (And arguably it looks nicer and makes
more sense in generic code.)

Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
has been written to.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# afc63a97b 15-Oct-2020 Jann Horn <jannh@google.com>

coredump: refactor page range dumping into common helper

Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of
pages into the coredump file. Extract that logic into a common helper.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-4-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce81bb25 15-Oct-2020 Chris Kennelly <ckennelly@google.com>

fs/binfmt_elf: use PT_LOAD p_align values for suitable start address

Patch series "Selecting Load Addresses According to p_align", v3.

The current ELF loading mechancism provides page-aligned mappings. This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.

While specifying -z,max-page-size=0x200000 to the linker will generate
suitably aligned segments for huge pages on x86_64, the executable needs
to be loaded at a suitably aligned address as well. This alignment
requires the binary's cooperation, as distinct segments need to be
appropriately paddded to be eligible for THP.

For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.

This patch (of 2):

The current ELF loading mechancism provides page-aligned mappings. This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.

For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.

Tested by verifying program with -Wl,-z,max-page-size=0x200000 loading.

[akpm@linux-foundation.org: fix max() warning]
[ckennelly@google.com: augment comment]
Link: https://lkml.kernel.org/r/20200821233848.3904680-2-ckennelly@google.com

Signed-off-by: Chris Kennelly <ckennelly@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: https://lkml.kernel.org/r/20200820170541.1132271-1-ckennelly@google.com
Link: https://lkml.kernel.org/r/20200820170541.1132271-2-ckennelly@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7a896028 12-Jun-2020 Al Viro <viro@zeniv.linux.org.uk>

kill elf_fpxregs_t

all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
been defined on any architecture since 2010

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b4e9c954 01-Jun-2020 Al Viro <viro@zeniv.linux.org.uk>

introduction of regset ->get() wrappers, switching ELF coredumps to those

Two new helpers: given a process and regset, dump into a buffer.
regset_get() takes a buffer and size, regset_get_alloc() takes size
and allocates a buffer.

Return value in both cases is the amount of data actually dumped in
case of success or -E... on error.

In both cases the size is capped by regset->n * regset->size, so
->get() is called with offset 0 and size no more than what regset
expects.

binfmt_elf.c callers of ->get() are switched to using those; the other
caller (copy_regset_to_user()) will need some preparations to switch.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 852991dd 04-Jun-2020 Anthony Iliopoulos <ailiop@suse.com>

fs/binfmt_elf: remove redundant elf_map ifndef

The ifndef was added a long time ago to support archs that would define
their own mapping function. The last user was the metag arch which was
removed from the tree, and as such there are no users left. Let's kill
it.

Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200402161543.4119-1-ailiop@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 646e84de 19-Feb-2020 Al Viro <viro@zeniv.linux.org.uk>

binfmt_elf: don't bother with __{put,copy_to}_user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1d605416 27-May-2020 Alexander Potapenko <glider@google.com>

fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()

KMSAN reported uninitialized data being written to disk when dumping
core. As a result, several kilobytes of kmalloc memory may be written
to the core file and then read by a non-privileged user.

Reported-by: sam <sunhaoyl@outlook.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200419100848.63472-1-glider@google.com
Link: https://github.com/google/kmsan/issues/76
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b8a61c9e 14-May-2020 Eric W. Biederman <ebiederm@xmission.com>

exec: Generic execfd support

Most of the support for passing the file descriptor of an executable
to an interpreter already lives in the generic code and in binfmt_elf.
Rework the fields in binfmt_elf that deal with executable file
descriptor passing to make executable file descriptor passing a first
class concept.

Move the fd_install from binfmt_misc into begin_new_exec after the new
creds have been installed. This means that accessing the file through
/proc/<pid>/fd/N is able to see the creds for the new executable
before allowing access to the new executables files.

Performing the install of the executables file descriptor after
the point of no return also means that nothing special needs to
be done on error. The exiting of the process will close all
of it's open files.

Move the would_dump from binfmt_misc into begin_new_exec right
after would_dump is called on the bprm->file. This makes it
obvious this case exists and that no nesting of bprm->file is
currently supported.

In binfmt_misc the movement of fd_install into generic code means
that it's special error exit path is no longer needed.

Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 2388777a 03-May-2020 Eric W. Biederman <ebiederm@xmission.com>

exec: Rename flush_old_exec begin_new_exec

There is and has been for a very long time been a lot more going on in
flush_old_exec than just flushing the old state. After the movement
of code from setup_new_exec there is a whole lot more going on than
just flushing the old executables state.

Rename flush_old_exec to begin_new_exec to more accurately reflect
what this function does.

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 96ecee29 03-May-2020 Eric W. Biederman <ebiederm@xmission.com>

exec: Merge install_exec_creds into setup_new_exec

The two functions are now always called one right after the
other so merge them together to make future maintenance easier.

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# d2530b43 04-May-2020 Christoph Hellwig <hch@lst.de>

binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump

There is no logic in elf_core_dump itself or in the various arch helpers
called from it which use uaccess routines on kernel pointers except for
the file writes thate are nicely encapsulated by using __kernel_write in
dump_emit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fa4751f4 04-May-2020 Eric W. Biederman <ebiederm@xmission.com>

binfmt_elf: remove the set_fs in fill_siginfo_note

The code in binfmt_elf.c is differnt from the rest of the code that
processes siginfo, as it sends siginfo from a kernel buffer to a file
rather than from kernel memory to userspace buffers. To remove it's
use of set_fs the code needs some different siginfo helpers.

Add the helper copy_siginfo_to_external to copy from the kernel's
internal siginfo layout to a buffer in the siginfo layout that
userspace expects.

Modify fill_siginfo_note to use copy_siginfo_to_external instead of
set_fs and copy_siginfo_to_user.

Update compat_binfmt_elf.c to use the previously added
copy_siginfo_to_external32 to handle the compat case.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# aa0d1564 06-Apr-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: don't free interpreter's ELF pheaders on common path

Static executables don't need to free NULL pointer.

It doesn't matter really because static executable is not common scenario
but do it anyway out of pedantry.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200219185330.GA4933@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0693ffeb 06-Apr-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: allocate less for static executable

PT_INTERP ELF header can be spared if executable is static.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200219185012.GB4871@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c69bcc93 06-Apr-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: delete "loc" variable

"loc" variable became just a wrapper for PT_INTERP ELF header after main
ELF header was moved to "bprm->buf". Delete it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200219184847.GA4871@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03911132 06-Apr-2020 Anshuman Khandual <anshuman.khandual@arm.com>

mm/vma: replace all remaining open encodings with is_vm_hugetlb_page()

This replaces all remaining open encodings with is_vm_hugetlb_page().

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Will Deacon <will@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fe0f6766 16-Mar-2020 Dave Martin <Dave.Martin@arm.com>

elf: Allow arch to tweak initial mmap prot flags

An arch may want to tweak the mmap prot flags for an
ELFexecutable's initial mappings. For example, arm64 is going to
need to add PROT_BTI for executable pages in an ELF process whose
executable is marked as using Branch Target Identification (an
ARMv8.5-A control flow integrity feature).

So that this can be done in a generic way, add a hook
arch_elf_adjust_prot() to modify the prot flags as desired: arches
can select CONFIG_HAVE_ELF_PROT and implement their own backend
where necessary.

By default, leave the prot flags unchanged.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>


# 00e19cee 16-Mar-2020 Dave Martin <Dave.Martin@arm.com>

ELF: Add ELF program property parsing support

ELF program properties will be needed for detecting whether to
enable optional architecture or ABI features for a new ELF process.

For now, there are no generic properties that we care about, so do
nothing unless CONFIG_ARCH_USE_GNU_PROPERTY=y.

Otherwise, the presence of properties using the PT_PROGRAM_PROPERTY
phdrs entry (if any), and notify each property to the arch code.

For now, the added code is not used.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>


# 1fbede6e 30-Jan-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: coredump: allow process with empty address space to coredump

Unmapping whole address space at once with

munmap(0, (1ULL<<47) - 4096)

or equivalent will create empty coredump.

It is silly way to exit, however registers content may still be useful.

The right to coredump is fundamental right of a process!

Link: http://lkml.kernel.org/r/20191222150137.GA1277@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 28f46656 30-Jan-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: coredump: delete duplicated overflow check

array_size() macro will do overflow check anyway.

Link: http://lkml.kernel.org/r/20191222144009.GB24341@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 225a3f53 30-Jan-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: coredump: allocate core ELF header on stack

Comment says ELF header is "too large to be on stack". 64 bytes on
64-bit is not large by any means.

Link: http://lkml.kernel.org/r/20191222143850.GA24341@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 18676ffc 30-Jan-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: make BAD_ADDR() unlikely

If some mapping goes past TASK_SIZE it will be rejected by kernel which
means no such userspace binaries exist.

Mark every such check as unlikely.

Link: http://lkml.kernel.org/r/20191215124355.GA21124@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03c6d723 30-Jan-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: better codegen around current->mm

"current->mm" pointer is stable in general except few cases one of which
execve(2). Compiler can't treat is as stable but it _is_ stable most of
the time. During ELF loading process ->mm becomes stable right after
flush_old_exec().

Help compiler by caching current->mm, otherwise it continues to refetch
it.

add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-141 (-141)
Function old new delta
elf_core_dump 5062 5039 -23
load_elf_binary 5426 5308 -118

Note: other cases are left as is because it is either pessimisation or
no change in binary size.

Link: http://lkml.kernel.org/r/20191215124755.GB21124@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a62c5b1b 30-Jan-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: don't copy ELF header around

ELF header is read into bprm->buf[] by generic execve code.

Save a memcpy and allocate just one header for the interpreter instead
of two headers (64 bytes instead of 128 on 64-bit).

Link: http://lkml.kernel.org/r/20191208171242.GA19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f67ef446 30-Jan-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: fix ->start_code calculation

Only executable segments should be accounted to ->start_code just like
they do to ->end_code (correctly).

Link: http://lkml.kernel.org/r/20191208171410.GB19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1f83d806 30-Jan-2020 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: smaller code generation around auxv vector fill

Filling auxv vector as array with index (auxv[i++] = ...) generates
terrible code. "saved_auxv" should be reworked because it is the worst
member of mm_struct by size/usefullness ratio but do it later.

Meanwhile help gcc a little with *auxv++ idiom.

Space savings on x86_64:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-127 (-127)
Function old new delta
load_elf_binary 5470 5343 -127

Link: http://lkml.kernel.org/r/20191208172301.GD19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 658c0335 04-Dec-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: extract elf_read() function

ELF reads done by the kernel have very complicated error detection code
which better live in one place.

Link: http://lkml.kernel.org/r/20191005165215.GB26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 81696d5d 04-Dec-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: delete unused "interp_map_addr" argument

Link: http://lkml.kernel.org/r/20191005165049.GA26927@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e2bb80d5 23-Nov-2017 Arnd Bergmann <arnd@arndb.de>

y2038: elfcore: Use __kernel_old_timeval for process times

We store elapsed time for a crashed process in struct elf_prstatus using
'timeval' structures. Once glibc starts using 64-bit time_t, this becomes
incompatible with the kernel's idea of timeval since the structure layout
no longer matches on 32-bit architectures.

This changes the definition of the elf_prstatus structure to use
__kernel_old_timeval instead, which is hardcoded to the currently used
binary layout. There is no risk of overflow in y2038 though, because
the time values are all relative times, and can store up to 68 years
of process elapsed time.

There is a risk of applications breaking at build time when they
use the new kernel headers and expect the type to be exactly 'timeval'
rather than a structure that has the same fields as before. Those
applications have to be modified to deal with 64-bit time_t anyway.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# b212921b 06-Oct-2019 Linus Torvalds <torvalds@linux-foundation.org>

elf: don't use MAP_FIXED_NOREPLACE for elf executable mappings

In commit 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map") we
changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the
executable mappings.

Then, people reported that it broke some binaries that had overlapping
segments from the same file, and commit ad55eac74f20 ("elf: enforce
MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some
overlaying elf segment cases. But only some - despite the summary line
of that commit, it only did it when it also does a temporary brk vma for
one obvious overlapping case.

Now Russell King reports another overlapping case with old 32-bit x86
binaries, which doesn't trigger that limited case. End result: we had
better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED.

Yes, it's a sign of old binaries generated with old tool-chains, but we
do pride ourselves on not breaking existing setups.

This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp()
and the old load_elf_library() use-cases, because nobody has reported
breakage for those. Yet.

Note that in all the cases seen so far, the overlapping elf sections
seem to be just re-mapping of the same executable with different section
attributes. We could possibly introduce a new MAP_FIXED_NOFILECHANGE
flag or similar, which acts like NOREPLACE, but allows just remapping
the same executable file using different protection flags.

It's not clear that would make a huge difference to anything, but if
people really hate that "elf remaps over previous maps" behavior, maybe
at least a more limited form of remapping would alleviate some concerns.

Alternatively, we should take a look at our elf_map() logic to see if we
end up not mapping things properly the first time.

In the meantime, this is the minimal "don't do that then" patch while
people hopefully think about it more.

Reported-by: Russell King <linux@armlinux.org.uk>
Fixes: 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map")
Fixes: ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments")
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7be3cb01 26-Sep-2019 Kees Cook <keescook@chromium.org>

binfmt_elf: Do not move brk for INTERP-less ET_EXEC

When brk was moved for binaries without an interpreter, it should have
been limited to ET_DYN only. In other words, the special case was an
ET_DYN that lacks an INTERP, not just an executable that lacks INTERP.
The bug manifested for giant static executables, where the brk would end
up in the middle of the text area on 32-bit architectures.

Reported-and-tested-by: Richard Kojedzinszky <richard@kojedz.in>
Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 649775be 23-Sep-2019 Alexandre Ghiti <alex@ghiti.fr>

mm, fs: move randomize_stack_top from fs to mm

Patch series "Provide generic top-down mmap layout functions", v6.

This series introduces generic functions to make top-down mmap layout
easily accessible to architectures, in particular riscv which was the
initial goal of this series. The generic implementation was taken from
arm64 and used successively by arm, mips and finally riscv.

Note that in addition the series fixes 2 issues:

- stack randomization was taken into account even if not necessary.

- [1] fixed an issue with mmap base which did not take into account
randomization but did not report it to arm and mips, so by moving arm64
into a generic library, this problem is now fixed for both
architectures.

This work is an effort to factorize architecture functions to avoid code
duplication and oversights as in [1].

[1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html

This patch (of 14):

This preparatory commit moves this function so that further introduction
of generic topdown mmap layout is contained only in mm/util.c.

Link: http://lkml.kernel.org/r/20190730055113.23635-2-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa94b1dc 16-Jul-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: delete stale comment

"passed_fileno" variable was deleted 11 years ago in 2.6.25.

Link: http://lkml.kernel.org/r/20190529201747.GA23248@avx2
Fixes: d20894a23708 ("Remove a.out interpreter support in ELF loader")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09c434b8 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for more missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have MODULE_LICENCE("GPL*") inside which was used in the initial
scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# bbdc6076 14-May-2019 Kees Cook <keescook@chromium.org>

binfmt_elf: move brk out of mmap when doing direct loader exec

Commmit eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"),
made changes in the rare case when the ELF loader was directly invoked
(e.g to set a non-inheritable LD_LIBRARY_PATH, testing new versions of
the loader), by moving into the mmap region to avoid both ET_EXEC and
PIE binaries. This had the effect of also moving the brk region into
mmap, which could lead to the stack and brk being arbitrarily close to
each other. An unlucky process wouldn't get its requested stack size
and stack allocations could end up scribbling on the heap.

This is illustrated here. In the case of using the loader directly, brk
(so helpfully identified as "[heap]") is allocated with the _loader_ not
the binary. For example, with ASLR entirely disabled, you can see this
more clearly:

$ /bin/cat /proc/self/maps
555555554000-55555555c000 r-xp 00000000 ... /bin/cat
55555575b000-55555575c000 r--p 00007000 ... /bin/cat
55555575c000-55555575d000 rw-p 00008000 ... /bin/cat
55555575d000-55555577e000 rw-p 00000000 ... [heap]
...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 ...
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]

$ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps
...
7ffff7bcc000-7ffff7bd4000 r-xp 00000000 ... /bin/cat
7ffff7bd4000-7ffff7dd3000 ---p 00008000 ... /bin/cat
7ffff7dd3000-7ffff7dd4000 r--p 00007000 ... /bin/cat
7ffff7dd4000-7ffff7dd5000 rw-p 00008000 ... /bin/cat
7ffff7dd5000-7ffff7dfc000 r-xp 00000000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7fb2000-7ffff7fd6000 rw-p 00000000 ...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff8020000 rw-p 00000000 ... [heap]
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]

The solution is to move brk out of mmap and into ELF_ET_DYN_BASE since
nothing is there in the direct loader case (and ET_EXEC is still far
away at 0x400000). Anything that ran before should still work (i.e.
the ultimately-launched binary already had the brk very far from its
text, so this should be no different from a COMPAT_BRK standpoint). The
only risk I see here is that if someone started to suddenly depend on
the entire memory space lower than the mmap region being available when
launching binaries via a direct loader execs which seems highly
unlikely, I'd hope: this would mean a binary would _not_ work when
exec()ed normally.

(Note that this is only done under CONFIG_ARCH_HAS_ELF_RANDOMIZATION
when randomization is turned on.)

Link: http://lkml.kernel.org/r/20190422225727.GA21011@beast
Link: https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=Omfhx_p0nCKPSjA@mail.gmail.com
Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Ali Saidi <alisaidi@amazon.com>
Cc: Ali Saidi <alisaidi@amazon.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 249b08e4 14-May-2019 Alexey Dobriyan <adobriyan@gmail.com>

elf: init pt_regs pointer later

Get "current_pt_regs" pointer right before usage.

Space savings on x86_64:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-180 (-180)
Function old new delta
load_elf_binary 5806 5626 -180 !!!

Looks like the compiler doesn't know that "current_pt_regs" is stable
pointer (because it doesn't know ->stack isn't) even though it knows
that "current" is stable pointer. So it saves it in the very beginning
and then tries to carry it through a lot of code.

Here is what happens here:

load_elf_binary()
...
mov rax,QWORD PTR gs:0x14c00
mov r13,QWORD PTR [rax+0x18] r13 = current->stack
call kmem_cache_alloc # first kmalloc

[980 bytes later!]

# let's spill that sucker because we need a register
# for "load_bias" calculations at
#
# if (interpreter) {
# load_bias = ELF_ET_DYN_BASE;
# if (current->flags & PF_RANDOMIZE)
# load_bias += arch_mmap_rnd();
# elf_flags |= elf_fixed;
# }
mov QWORD PTR [rsp+0x68],r13

If this is not _the_ root cause it is still eeeeh.

After the patch things become much simpler:

mov rax, QWORD PTR gs:0x14c00 # current
mov rdx, QWORD PTR [rax+0x18] # current->stack
movq [rdx+0x3fb8], 0 # fill pt_regs
...
call finalize_exec

Link: http://lkml.kernel.org/r/20190419200343.GA19788@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Tested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8e7cb39 14-May-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: extract PROT_* calculations

There are two places where mapping protections are calculated: one for
executable, another one for interpreter -- take them out.

ELF read and execute permissions are interchanged with Linux PROT_READ
and PROT_EXEC, microoptimizations are welcome!

Link: http://lkml.kernel.org/r/20190417213413.GB26474@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 85264316 14-May-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs//binfmt_elf.c: move variables initialization closer to their usage

Link: http://lkml.kernel.org/r/20190416202002.GB24304@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# be0deb58 14-May-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: save 1 indent level

Rewrite

for (...) {
if (->p_type == PT_INTERP) {
...
break;
}
}

loop into

for (...) {
if (->p_type != PT_INTERP)
continue;
...
break;
}

Link: http://lkml.kernel.org/r/20190416201906.GA24304@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ba0f6b88 14-May-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: delete trailing "return;" in functions returning "void"

Link: http://lkml.kernel.org/r/20190314205042.GE18143@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cc338010 14-May-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: free PT_INTERP filename ASAP

There is no reason for PT_INTERP filename to linger till the end of the
whole loading process.

Link: http://lkml.kernel.org/r/20190314204953.GD18143@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nikitas Angelinas <nikitas.angelinas@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mukesh Ojha <mojha@codeaurora.org>
[nikitas.angelinas@gmail.com: fix GPF when dereferencing invalid interpreter]
Link: http://lkml.kernel.org/r/20190330140032.GA1527@vostro
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5cf4a363 14-May-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: make scope of "pos" variable smaller

Link: http://lkml.kernel.org/r/20190314204707.GC18143@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 22f084db 14-May-2019 Andrew Morton <akpm@linux-foundation.org>

fs/binfmt_elf.c: remove unneeded initialization of mm->start_stack

As pointed out by zoujc@lenovo.com, setup_arg_pages() already
initialized current->mm->start_stack.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202881
Reported-by: <zoujc@lenovo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49ac9819 07-Mar-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: spread const a little

Link: http://lkml.kernel.org/r/20190204202830.GC27482@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 93f044e2 07-Mar-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: use list_for_each_entry()

[adobriyan@gmail.com: fixup compilation]
Link: http://lkml.kernel.org/r/20190205064334.GA2152@avx2
Link: http://lkml.kernel.org/r/20190204202800.GB27482@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# faf1c315 07-Mar-2019 Alexey Dobriyan <adobriyan@gmail.com>

fs/binfmt_elf.c: don't be afraid of overflow

Number of ELF program headers is 16-bit by spec, so total size
comfortably fits into "unsigned int".

Space savings: 7 bytes!

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-7 (-7)
Function old new delta
load_elf_phdrs 137 130 -7

Link: http://lkml.kernel.org/r/20190204202715.GA27482@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ae7795bc 25-Sep-2018 Eric W. Biederman <ebiederm@xmission.com>

signal: Distinguish between kernel_siginfo and siginfo

Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.

The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.

So split siginfo in two; kernel_siginfo and siginfo. Keeping the
traditional name for the userspace definition. While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.

The definition of struct kernel_siginfo I have put in include/signal_types.h

A set of buildtime checks has been added to verify the two structures have
the same field offsets.

To make it easy to verify the change kernel_siginfo retains the same
size as siginfo. The reduction in size comes in a following change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 2f819db5 15-May-2018 Maciej W. Rozycki <macro@mips.com>

binfmt_elf: Respect error return from `regset->active'

The regset API documented in <linux/regset.h> defines -ENODEV as the
result of the `->active' handler to be used where the feature requested
is not available on the hardware found. However code handling core file
note generation in `fill_thread_core_info' interpretes any non-zero
result from the `->active' handler as the regset requested being active.
Consequently processing continues (and hopefully gracefully fails later
on) rather than being abandoned right away for the regset requested.

Fix the problem then by making the code proceed only if a positive
result is returned from the `->active' handler.

Signed-off-by: Maciej W. Rozycki <macro@mips.com>
Signed-off-by: Paul Burton <paul.burton@mips.com>
Fixes: 4206d3aa1978 ("elf core dump: notes user_regset")
Patchwork: https://patchwork.linux-mips.org/patch/19332/
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org


# 24962af7 13-Jul-2018 Oscar Salvador <osalvador@suse.de>

fs, elf: make sure to page align bss in load_elf_library

The current code does not make sure to page align bss before calling
vm_brk(), and this can lead to a VM_BUG_ON() in __mm_populate() due to
the requested lenght not being correctly aligned.

Let us make sure to align it properly.

Kees: only applicable to CONFIG_USELIB kernels: 32-bit and configured
for libc5.

Link: http://lkml.kernel.org/r/20180705145539.9627-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86a2bb5a 14-Jun-2018 Alexey Dobriyan <adobriyan@gmail.com>

coredump: fix spam with zero VMA process

Nobody ever tried to self destruct by unmapping whole address space at
once:

munmap((void *)0, (1ULL << 47) - 4096);

Doing this produces 2 warnings for zero-length vmalloc allocations:

a.out[1353]: segfault at 7f80bcc4b757 ip 00007f80bcc4b757 sp 00007fff683939b8 error 14
a.out: vmalloc: allocation failure: 0 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null)
...
a.out: vmalloc: allocation failure: 0 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null)
...

Fix is to switch to kvmalloc().

Steps to reproduce:

// vsyscall=none
#include <sys/mman.h>
#include <sys/resource.h>
int main(void)
{
setrlimit(RLIMIT_CORE, &(struct rlimit){RLIM_INFINITY, RLIM_INFINITY});
munmap((void *)0, (1ULL << 47) - 4096);
return 0;
}

Link: http://lkml.kernel.org/r/20180410180353.GA2515@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 42bc47b3 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: Use array_size() in vmalloc()

The vmalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

vmalloc(a * b)

with:
vmalloc(array_size(a, b))

as well as handling cases of:

vmalloc(a * b * c)

with:

vmalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

vmalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
vmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
vmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
vmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
vmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
vmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
vmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
vmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
vmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
vmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
vmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
vmalloc(
- sizeof(TYPE) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * COUNT_ID
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(THING) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * COUNT_ID
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

vmalloc(
- SIZE * COUNT
+ array_size(COUNT, SIZE)
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
vmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
vmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
vmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
vmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
vmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
vmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
vmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
vmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
vmalloc(C1 * C2 * C3, ...)
|
vmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
vmalloc(C1 * C2, ...)
|
vmalloc(
- E1 * E2
+ array_size(E1, E2)
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# 6da2ec56 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kmalloc() -> kmalloc_array()

The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

kmalloc(a * b, gfp)

with:
kmalloc_array(a * b, gfp)

as well as handling cases of:

kmalloc(a * b * c, gfp)

with:

kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# d23a61ee 20-Apr-2018 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error

Commit 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map") is
printing spurious messages under memory pressure due to map_addr == -ENOMEM.

9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already

Complain only if -EEXIST, and use %px for printing the address.

Link: http://lkml.kernel.org/r/201804182307.FAC17665.SFMOFJVFtHOLOQ@I-love.SAKURA.ne.jp
Fixes: 4ed28639519c7bad ("fs, elf: drop MAP_FIXED usage from elf_map") is
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ad55eac7 10-Apr-2018 Michal Hocko <mhocko@suse.com>

elf: enforce MAP_FIXED on overlaying elf segments

Anshuman has reported that with "fs, elf: drop MAP_FIXED usage from
elf_map" applied, some ELF binaries in his environment fail to start
with

[ 23.423642] 9148 (sed): Uhuuh, elf segment at 0000000010030000 requested but the memory is mapped already
[ 23.423706] requested [10030000, 10040000] mapped [10030000, 10040000] 100073 anon

The reason is that the above binary has overlapping elf segments:

LOAD 0x0000000000000000 0x0000000010000000 0x0000000010000000
0x0000000000013a8c 0x0000000000013a8c R E 10000
LOAD 0x000000000001fd40 0x000000001002fd40 0x000000001002fd40
0x00000000000002c0 0x00000000000005e8 RW 10000
LOAD 0x0000000000020328 0x0000000010030328 0x0000000010030328
0x0000000000000384 0x00000000000094a0 RW 10000

That binary has two RW LOAD segments, the first crosses a page border
into the second

0x1002fd40 (LOAD2-vaddr) + 0x5e8 (LOAD2-memlen) == 0x10030328 (LOAD3-vaddr)

Handle this situation by enforcing MAP_FIXED when we establish a
temporary brk VMA to handle overlapping segments. All other mappings
will still use MAP_FIXED_NOREPLACE.

Link: http://lkml.kernel.org/r/20180213100440.GM3443@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4ed28639 10-Apr-2018 Michal Hocko <mhocko@suse.com>

fs, elf: drop MAP_FIXED usage from elf_map

Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that. This is
however dangerous thing prone to silent data corruption which can be
even exploitable.

Let's take CVE-2017-1000253 as an example. At the time (before commit
eab09532d400: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
we could end up mapping over the existing stack with some luck.

The issue has been fixed since then (a87938b2e246: "fs/binfmt_elf.c: fix
bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d400 and later by c715b72c1ba4: "mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive
stack consumption early during execve fully stopped by da029c11e6b1
("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
safe and any attack should be impractical. On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.

I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous. Moreover it shouldn't be even needed. We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address should
succeed even without MAP_FIXED. Something is terribly wrong if this is
not the case and we should rather fail than silently corrupt the
underlying mapping.

Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.

[mhocko@suse.com: fix build]
[akpm@linux-foundation.org: coding-style fixes]
[avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC]
Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b8383831 10-Apr-2018 Kees Cook <keescook@chromium.org>

exec: introduce finalize_exec() before start_thread()

Provide a final callback into fs/exec.c before start_thread() takes
over, to handle any last-minute changes, like the coming restoration of
the stack limit.

Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Greg KH <greg@kroah.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 60c9d92f 06-Feb-2018 Alexey Dobriyan <adobriyan@gmail.com>

elf: fix NT_FILE integer overflow

If vm.max_map_count bumped above 2^26 (67+ mil) and system has enough RAM
to allocate all the VMAs (~12.8 GB on Fedora 27 with 200-byte VMAs), then
it should be possible to overflow 32-bit "size", pass paranoia check,
allocate very little vmalloc space and oops while writing into vmalloc
guard page...

But I didn't test this, only coredump of regular process.

Link: http://lkml.kernel.org/r/20180112203427.GA9109@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 27e64b4b 31-Oct-2017 Dave Martin <Dave.Martin@arm.com>

regset: Add support for dynamically sized regsets

Currently the regset API doesn't allow for the possibility that
regsets (or at least, the amount of meaningful data in a regset)
may change in size.

In particular, this results in useless padding being added to
coredumps if a regset's current size is smaller than its
theoretical maximum size.

This patch adds a get_size() function to struct user_regset.
Individual regset implementations can implement this function to
return the current size of the regset data. A regset_size()
function is added to provide callers with an abstract interface for
determining the size of a regset without needing to know whether
the regset is dynamically sized or not.

The only affected user of this interface is the ELF coredump code:
This patch ports ELF coredump to dump regsets with their actual
size in the coredump. This has no effect except for new regsets
that are dynamically sized and provide a get_size() implementation.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: H. J. Lu <hjl.tools@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>


# 4755200b 16-Aug-2017 Nicolas Pitre <nico@fluxnic.net>

binfmt_elf: don't attempt to load FDPIC binaries

On platforms where both ELF and ELF-FDPIC variants are available, the
regular ELF loader will happily identify FDPIC binaries as proper ELF
and load them without the necessary FDPIC fixups, resulting in an
immediate user space crash. Let's prevent binflt_elf from loading those
binaries so binfmt_elf_fdpic has a chance to pick them up. For those
architectures that don't define elf_check_fdpic(), a default version
returning false is provided.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Mickael GUENE <mickael.guene@st.com>
Tested-by: Vincent Abriou <vincent.abriou@st.com>
Tested-by: Andras Szemzo <szemzo.andras@gmail.com>


# bdd1d2d3 01-Sep-2017 Christoph Hellwig <hch@lst.de>

fs: fix kernel_read prototype

Use proper ssize_t and size_t types for the return value and count
argument, move the offset last and make it an in/out argument like
all other read/write helpers, and make the buf argument a void pointer
to get rid of lots of casts in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 01578e36 15-Aug-2017 Oleg Nesterov <oleg@redhat.com>

x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks

The ADDR_NO_RANDOMIZE checks in stack_maxrandom_size() and
randomize_stack_top() are not required.

PF_RANDOMIZE is set by load_elf_binary() only if ADDR_NO_RANDOMIZE is not
set, no need to re-check after that.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: stable@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20170815154011.GB1076@redhat.com


# c425e189 18-Jul-2017 Kees Cook <keescook@chromium.org>

binfmt: Introduce secureexec flag

The bprm_secureexec hook can be moved earlier. Right now, it is called
during create_elf_tables(), via load_binary(), via search_binary_handler(),
via exec_binprm(). Nearly all (see exception below) state used by
bprm_secureexec is created during the bprm_set_creds hook, called from
prepare_binprm().

For all LSMs (except commoncaps described next), only the first execution
of bprm_set_creds takes any effect (they all check bprm->called_set_creds
which prepare_binprm() sets after the first call to the bprm_set_creds
hook). However, all these LSMs also only do anything with bprm_secureexec
when they detected a secure state during their first run of bprm_set_creds.
Therefore, it is functionally identical to move the detection into
bprm_set_creds, since the results from secureexec here only need to be
based on the first call to the LSM's bprm_set_creds hook.

The single exception is that the commoncaps secureexec hook also examines
euid/uid and egid/gid differences which are controlled by bprm_fill_uid(),
via prepare_binprm(), which can be called multiple times (e.g.
binfmt_script, binfmt_misc), and may clear the euid/egid for the final
load (i.e. the script interpreter). However, while commoncaps specifically
ignores bprm->cred_prepared, and runs its bprm_set_creds hook each time
prepare_binprm() may get called, it needs to base the secureexec decision
on the final call to bprm_set_creds. As a result, it will need special
handling.

To begin this refactoring, this adds the secureexec flag to the bprm
struct, and calls the secureexec hook during setup_new_exec(). This is
safe since all the cred work is finished (and past the point of no return).
This explicit call will be removed in later patches once the hook has been
removed.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>


# 67c6777a 10-Jul-2017 Kees Cook <keescook@chromium.org>

binfmt_elf: safely increment argv pointers

When building the argv/envp pointers, the envp is needlessly
pre-incremented instead of just continuing after the argv pointers are
finished. In some (likely impossible) race where the strings could be
changed from userspace between copy_strings() and here, it might be
possible to confuse the envp position. Instead, just use sp like
everything else.

Link: http://lkml.kernel.org/r/20170622173838.GA43308@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eab09532 10-Jul-2017 Kees Cook <keescook@chromium.org>

binfmt_elf: use ELF_ET_DYN_BASE only for PIE

The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.)

With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
ELF_ET_DYN_BASE continued to be used since the kernel was only looking
at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the
top 1/3rd of the TASK_SIZE, a substantial portion of the address space
is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
loaded above the mmap region. This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
pathological stack regions.

Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
region in all cases, and will now additionally avoid programs falling
back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
if it would have collided with the stack, now it will fail to load
instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by
the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
which means architectures can now safely lower their values without risk
of loaders colliding with their subsequently loaded programs.

For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
the entire 32-bit address space for 32-bit pointers.

Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
suggestions on how to implement this solution.

Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR")
Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32ef5517 05-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move cputime functionality from <linux/sched.h> into <linux/sched/cputime.h>

Introduce a trivial, mostly empty <linux/sched/cputime.h> header
to prepare for the moving of cputime functionality out of sched.h.

Update all code that relies on these facilities.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 68db0cf1 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h>

We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5b825c3a 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>

Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# f7ccbae4 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h>

We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/coredump.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 16e72e9b 22-Feb-2017 Denys Vlasenko <dvlasenk@redhat.com>

powerpc: do not make the entire heap executable

On 32-bit powerpc the ELF PLT sections of binaries (built with
--bss-plt, or with a toolchain which defaults to it) look like this:

[17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4
[18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4
[19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4

Which results in an ELF load header:

Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000

This is all correct, the load region containing the PLT is marked as
executable. Note that the PLT starts at 0002b00c but the file mapping
ends at 0002aff8, so the PLT falls in the 0 fill section described by
the load header, and after a page boundary.

Unfortunately the generic ELF loader ignores the X bit in the load
headers when it creates the 0 filled non-file backed mappings. It
assumes all of these mappings are RW BSS sections, which is not the case
for PPC.

gcc/ld has an option (--secure-plt) to not do this, this is said to
incur a small performance penalty.

Currently, to support 32-bit binaries with PLT in BSS kernel maps
*entire brk area* with executable rights for all binaries, even
--secure-plt ones.

Stop doing that.

Teach the ELF loader to check the X bit in the relevant load header and
create 0 filled anonymous mappings that are executable if the load
header requests that.

Test program showing the difference in /proc/$PID/maps:

int main() {
char buf[16*1024];
char *p = malloc(123); /* make "[heap]" mapping appear */
int fd = open("/proc/self/maps", O_RDONLY);
int len = read(fd, buf, sizeof(buf));
write(1, buf, len);
printf("%p\n", p);
return 0;
}

Compiled using: gcc -mbss-plt -m32 -Os test.c -otest

Unpatched ppc64 kernel:
00100000-00120000 r-xp 00000000 00:00 0 [vdso]
0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
10690000-106c0000 rwxp 00000000 00:00 0 [heap]
f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack]
0x10690008

Patched ppc64 kernel:
00100000-00120000 r-xp 00000000 00:00 0 [vdso]
0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
10180000-101b0000 rw-p 00000000 00:00 0 [heap]
^^^^ this has changed
f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
ff860000-ff890000 rw-p 00000000 00:00 0 [stack]
0x10180008

The patch was originally posted in 2012 by Jason Gunthorpe
and apparently ignored:

https://lkml.org/lkml/2012/9/30/138

Lightly run-tested.

Link: http://lkml.kernel.org/r/20161215131950.23054-1-dvlasenk@redhat.com
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cd19c364 30-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

fs/binfmt: Convert obsolete cputime type to nsecs

Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-12-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5613fda9 30-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime: Convert task/group cputime to nsecs

Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a1cecf2b 30-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime: Introduce special task_cputime_t() API to return old-typed cputime

This API returns a task's cputime in cputime_t in order to ease the
conversion of cputime internals to use nsecs units instead. Blindly
converting all cputime readers to use this API now will later let us
convert more smoothly and step by step all these places to use the
new nsec based cputime.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4d22c75d 11-Jan-2017 Dave Kleikamp <dave.kleikamp@oracle.com>

coredump: Ensure proper size of sparse core files

If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.

After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.

This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7c0f6ba6 24-Dec-2016 Linus Torvalds <torvalds@linux-foundation.org>

Replace <asm/uaccess.h> with <linux/uaccess.h> globally

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 30f74aa0 12-Dec-2016 Jason Baron <jbaron@akamai.com>

binfmt_elf: use vmalloc() for allocation of vma_filesz

We have observed page allocations failures of order 4 during core dump
while trying to allocate vma_filesz. This results in a useless core
file of size 0. To improve reliability use vmalloc().

Note that the vmalloc() allocation is bounded by sysctl_max_map_count,
which is 65,530 by default. So with a 4k page size, and 8 bytes per
seg, this is a max of 128 pages or an order 7 allocation. Other parts
of the core dump path, such as fill_files_note() are already using
vmalloc() for presumably similar reasons.

Link: http://lkml.kernel.org/r/1479745791-17611-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 90954e7b 05-Sep-2016 Dmitry Safonov <0x7f454c46@gmail.com>

x86/coredump: Use pr_reg size, rather that TIF_IA32 flag

Killed PR_REG_SIZE and PR_REG_PTR macro as we can get regset size
from regset view.
I wish I could also kill PRSTATUS_SIZE nicely.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: 0x7f454c46@gmail.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: gorcunov@openvz.org
Cc: xemul@virtuozzo.com
Link: http://lkml.kernel.org/r/20160905133308.28234-5-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 9f834ec1 22-Aug-2016 Linus Torvalds <torvalds@linux-foundation.org>

binfmt_elf: switch to new creds when switching to new mm

We used to delay switching to the new credentials until after we had
mapped the executable (and possible elf interpreter). That was kind of
odd to begin with, since the new executable will actually then _run_
with the new creds, but whatever.

The bigger problem was that we also want to make sure that we turn off
prof events and tracing before we start mapping the new executable
state. So while this is a cleanup, it's also a fix for a possible
information leak.

Reported-by: Robert Święcki <robert@swiecki.net>
Tested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0036d1f7 02-Aug-2016 Kees Cook <keescook@chromium.org>

binfmt_elf: fix calculations for bss padding

A double-bug exists in the bss calculation code, where an overflow can
happen in the "last_bss - elf_bss" calculation, but vm_brk internally
aligns the argument, underflowing it, wrapping back around safe. We
shouldn't depend on these bugs staying in sync, so this cleans up the
bss padding handling to avoid the overflow.

This moves the bss padzero() before the last_bss > elf_bss case, since
the zero-filling of the ELF_PAGE should have nothing to do with the
relationship of last_bss and elf_bss: any trailing portion should be
zeroed, and a zero size is already handled by padzero().

Then it handles the math on elf_bss vs last_bss correctly. These need
to both be ELF_PAGE aligned to get the comparison correct, since that's
the expected granularity of the mappings. Since elf_bss already had
alignment-based padding happen in padzero(), the "start" of the new
vm_brk() should be moved forward as done in the original code. However,
since the "end" of the vm_brk() area will already become PAGE_ALIGNed in
vm_brk() then last_bss should get aligned here to avoid hiding it as a
side-effect.

Additionally makes a cosmetic change to the initial last_bss calculation
so it's easier to read in comparison to the load_addr calculation above
it (i.e. the only difference is p_filesz vs p_memsz).

Link: http://lkml.kernel.org/r/1468014494-25291-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Ismael Ripoll Ripoll <iripoll@upv.es>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1607f09c 05-Jun-2016 Mateusz Guzik <mguzik@redhat.com>

coredump: fix dumping through pipes

The offset in the core file used to be tracked with ->written field of
the coredump_params structure. The field was retired in favour of
file->f_pos.

However, ->f_pos is not maintained for pipes which leads to breakage.

Restore explicit tracking of the offset in coredump_params. Introduce
->pos field for this purpose since ->written was already reused.

Fixes: a00839395103 ("get rid of coredump_params->written").

Reported-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5d22fc25 27-May-2016 Linus Torvalds <torvalds@linux-foundation.org>

mm: remove more IS_ERR_VALUE abuses

The do_brk() and vm_brk() return value was "unsigned long" and returned
the starting address on success, and an error value on failure. The
reasons are entirely historical, and go back to it basically behaving
like the mmap() interface does.

However, nobody actually wanted that interface, and it causes totally
pointless IS_ERR_VALUE() confusion.

What every single caller actually wants is just the simpler integer
return of zero for success and negative error number on failure.

So just convert to that much clearer and more common calling convention,
and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ecc2bc8a 23-May-2016 Michal Hocko <mhocko@suse.com>

mm, elf: handle vm_brk error

load_elf_library doesn't handle vm_brk failure although nothing really
indicates it cannot do that because the function is allowed to fail due
to vm_mmap failures already. This might be not a problem now but later
patch will make vm_brk killable (resp. mmap_sem for write waiting will
become killable) and so the failure will be more probable.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a0083939 11-May-2016 Omar Sandoval <osandov@fb.com>

coredump: get rid of coredump_params->written

cprm->written is redundant with cprm->file->f_pos, so use that instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ef11c35 26-Feb-2016 Daniel Cashman <dcashman@android.com>

mm: ASLR: use get_random_long()

Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long(). Also address shifting bug which, in
case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.

Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eb4bc076 12-Nov-2015 Maciej W. Rozycki <macro@mips.com>

ELF: Also pass any interpreter's file header to `arch_check_elf'

Also pass any interpreter's file header to `arch_check_elf' so that any
architecture handler can have a look at it if needed.

Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Fortune <Matthew.Fortune@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/11478/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>


# 54d15714 26-Oct-2015 Maciej W. Rozycki <macro@mips.com>

binfmt_elf: Correct `arch_check_elf's description

Correct `arch_check_elf's description, mistakenly copied and pasted from
`arch_elf_pt_proc'.

Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b582ef5c 26-Oct-2015 Maciej W. Rozycki <macro@mips.com>

binfmt_elf: Don't clobber passed executable's file header

Do not clobber the buffer space passed from `search_binary_handler' and
originally preloaded by `prepare_binprm' with the executable's file
header by overwriting it with its interpreter's file header. Instead
keep the buffer space intact and directly use the data structure locally
allocated for the interpreter's file header, fixing a bug introduced in
2.1.14 with loadable module support (linux-mips.org commit beb11695
[Import of Linux/MIPS 2.1.14], predating kernel.org repo's history).
Adjust the amount of data read from the interpreter's file accordingly.

This was not an issue before loadable module support, because back then
`load_elf_binary' was executed only once for a given ELF executable,
whether the function succeeded or failed.

With loadable module support supported and enabled, upon a failure of
`load_elf_binary' -- which may for example be caused by architecture
code rejecting an executable due to a missing hardware feature requested
in the file header -- a module load is attempted and then the function
reexecuted by `search_binary_handler'. With the executable's file
header replaced with its interpreter's file header the executable can
then be erroneously accepted in this subsequent attempt.

Cc: stable@vger.kernel.org # all the way back
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5037835c 05-Oct-2015 Ross Zwisler <zwisler@kernel.org>

coredump: add DAX filtering for ELF coredumps

Add two new flags to the existing coredump mechanism for ELF files to
allow us to explicitly filter DAX mappings. This is desirable because
DAX mappings, like hugetlb mappings, have the potential to be very
large.

Update the coredump_filter documentation in
Documentation/filesystems/proc.txt so that it addresses the new DAX
coredump flags. Also update the documented default value of
coredump_filter to be consistent with the core(5) man page. The
documentation being updated talks about bit 4, Dump ELF headers, which
is enabled if CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is turned on in the
kernel config. This kernel config option defaults to "y" if both ELF
binaries and coredump are enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 9bf39ab2 19-Jun-2015 Miklos Szeredi <mszeredi@suse.cz>

vfs: add file_path() helper

Turn
d_path(&file->f_path, ...);
into
file_path(file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2b1d3ae9 28-May-2015 Andrew Morton <akpm@linux-foundation.org>

fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappings

load_elf_binary() returns `retval', not `error'.

Fixes: a87938b2e246b81b4fb ("fs/binfmt_elf.c: fix bug in loading of PIE binaries")
Reported-by: James Hogan <james.hogan@imgtec.com>
Cc: Michael Davidson <md@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 204db6ed 14-Apr-2015 Kees Cook <keescook@chromium.org>

mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE

The arch_randomize_brk() function is used on several architectures,
even those that don't support ET_DYN ASLR. To avoid bulky extern/#define
tricks, consolidate the support under CONFIG_ARCH_HAS_ELF_RANDOMIZE for
the architectures that support it, while still handling CONFIG_COMPAT_BRK.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Russell King <linux@arm.linux.org.uk>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "David A. Long" <dave.long@linaro.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Arun Chandran <achandran@mvista.com>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Min-Hua Chen <orca.chen@gmail.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Alex Smith <alex@alex-smith.me.uk>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Vineeth Vijayan <vvijayan@mvista.com>
Cc: Jeff Bailey <jeffbailey@google.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Behan Webster <behanw@converseincode.com>
Cc: Ismael Ripoll <iripoll@upv.es>
Cc: Jan-Simon Mller <dl9pf@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d1fd836d 14-Apr-2015 Kees Cook <keescook@chromium.org>

mm: split ET_DYN ASLR from mmap ASLR

This fixes the "offset2lib" weakness in ASLR for arm, arm64, mips,
powerpc, and x86. The problem is that if there is a leak of ASLR from
the executable (ET_DYN), it means a leak of shared library offset as
well (mmap), and vice versa. Further details and a PoC of this attack
is available here:

http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html

With this patch, a PIE linked executable (ET_DYN) has its own ASLR
region:

$ ./show_mmaps_pie
54859ccd6000-54859ccd7000 r-xp ... /tmp/show_mmaps_pie
54859ced6000-54859ced7000 r--p ... /tmp/show_mmaps_pie
54859ced7000-54859ced8000 rw-p ... /tmp/show_mmaps_pie
7f75be764000-7f75be91f000 r-xp ... /lib/x86_64-linux-gnu/libc.so.6
7f75be91f000-7f75beb1f000 ---p ... /lib/x86_64-linux-gnu/libc.so.6
7f75beb1f000-7f75beb23000 r--p ... /lib/x86_64-linux-gnu/libc.so.6
7f75beb23000-7f75beb25000 rw-p ... /lib/x86_64-linux-gnu/libc.so.6
7f75beb25000-7f75beb2a000 rw-p ...
7f75beb2a000-7f75beb4d000 r-xp ... /lib64/ld-linux-x86-64.so.2
7f75bed45000-7f75bed46000 rw-p ...
7f75bed46000-7f75bed47000 r-xp ...
7f75bed47000-7f75bed4c000 rw-p ...
7f75bed4c000-7f75bed4d000 r--p ... /lib64/ld-linux-x86-64.so.2
7f75bed4d000-7f75bed4e000 rw-p ... /lib64/ld-linux-x86-64.so.2
7f75bed4e000-7f75bed4f000 rw-p ...
7fffb3741000-7fffb3762000 rw-p ... [stack]
7fffb377b000-7fffb377d000 r--p ... [vvar]
7fffb377d000-7fffb377f000 r-xp ... [vdso]

The change is to add a call the newly created arch_mmap_rnd() into the
ELF loader for handling ET_DYN ASLR in a separate region from mmap ASLR,
as was already done on s390. Removes CONFIG_BINFMT_ELF_RANDOMIZE_PIE,
which is no longer needed.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Russell King <linux@arm.linux.org.uk>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "David A. Long" <dave.long@linaro.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Arun Chandran <achandran@mvista.com>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Min-Hua Chen <orca.chen@gmail.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Alex Smith <alex@alex-smith.me.uk>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Vineeth Vijayan <vvijayan@mvista.com>
Cc: Jeff Bailey <jeffbailey@google.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Behan Webster <behanw@converseincode.com>
Cc: Ismael Ripoll <iripoll@upv.es>
Cc: Jan-Simon Mller <dl9pf@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a87938b2 14-Apr-2015 Michael Davidson <md@google.com>

fs/binfmt_elf.c: fix bug in loading of PIE binaries

With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down
address allocation strategy, load_elf_binary() will attempt to map a PIE
binary into an address range immediately below mm->mmap_base.

Unfortunately, load_elf_ binary() does not take account of the need to
allocate sufficient space for the entire binary which means that, while
the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent
PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are
that is supposed to be the "gap" between the stack and the binary.

Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this
means that binaries with large data segments > 128MB can end up mapping
part of their data segment over their stack resulting in corruption of the
stack (and the data segment once the binary starts to run).

Any PIE binary with a data segment > 128MB is vulnerable to this although
address randomization means that the actual gap between the stack and the
end of the binary is normally greater than 128MB. The larger the data
segment of the binary the higher the probability of failure.

Fix this by calculating the total size of the binary in the same way as
load_elf_interp().

Signed-off-by: Michael Davidson <md@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e7c22d4 14-Feb-2015 Hector Marco-Gisbert <hecmargi@upv.es>

x86, mm/ASLR: Fix stack randomization on 64-bit systems

The issue is that the stack for processes is not properly randomized on
64 bit architectures due to an integer overflow.

The affected function is randomize_stack_top() in file
"fs/binfmt_elf.c":

static unsigned long randomize_stack_top(unsigned long stack_top)
{
unsigned int random_variable = 0;

if ((current->flags & PF_RANDOMIZE) &&
!(current->personality & ADDR_NO_RANDOMIZE)) {
random_variable = get_random_int() & STACK_RND_MASK;
random_variable <<= PAGE_SHIFT;
}
return PAGE_ALIGN(stack_top) + random_variable;
return PAGE_ALIGN(stack_top) - random_variable;
}

Note that, it declares the "random_variable" variable as "unsigned int".
Since the result of the shifting operation between STACK_RND_MASK (which
is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):

random_variable <<= PAGE_SHIFT;

then the two leftmost bits are dropped when storing the result in the
"random_variable". This variable shall be at least 34 bits long to hold
the (22+12) result.

These two dropped bits have an impact on the entropy of process stack.
Concretely, the total stack entropy is reduced by four: from 2^28 to
2^30 (One fourth of expected entropy).

This patch restores back the entropy by correcting the types involved
in the operations in the functions randomize_stack_top() and
stack_maxrandom_size().

The successful fix can be tested with:

$ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0 [stack]
7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0 [stack]
7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0 [stack]
7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0 [stack]
...

Once corrected, the leading bytes should be between 7ffc and 7fff,
rather than always being 7fff.

Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es>
Signed-off-by: Ismael Ripoll <iripoll@upv.es>
[ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: CVE-2015-1593
Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net
Signed-off-by: Borislav Petkov <bp@suse.de>


# 52f5592e 10-Dec-2014 Jungseung Lee <js07.lee@gmail.com>

fs/binfmt_elf.c: fix internal inconsistency relating to vma dump size

vma_dump_size() has been used several times on actual dumper and it is
supposed to return the same value for the same vma. But vma_dump_size()
could return different values for same vma.

The known problem case is concurrent shared memory removal. If a vma is
used for a shared memory and that shared memory is removed between
writing program header and dumping vma memory, this will result in a
dump file which is internally consistent.

To fix the problem, we set baseline to get dump size and store the size
into vma_filesz and always use the same vma dump size which is stored in
vma_filsz. The consistnecy with reality is not actually guranteed, but
it's tolerable since that is fully consistent with base line.

Signed-off-by: Jungseung Lee <js07.lee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 774c105e 11-Sep-2014 Paul Burton <paulburton@kernel.org>

binfmt_elf: allow arch code to examine PT_LOPROC ... PT_HIPROC headers

MIPS is introducing new variants of its O32 ABI which differ in their
handling of floating point, in order to enable a gradual transition
towards a world where mips32 binaries can take advantage of new hardware
features only available when configured for certain FP modes. In order
to do this ELF binaries are being augmented with a new section that
indicates, amongst other things, the FP mode requirements of the binary.
The presence & location of such a section is indicated by a program
header in the PT_LOPROC ... PT_HIPROC range.

In order to allow the MIPS architecture code to examine the program
header & section in question, pass all program headers in this range
to an architecture-specific arch_elf_pt_proc function. This function
may return an error if the header is deemed invalid or unsuitable for
the system, in which case that error will be returned from
load_elf_binary and upwards through the execve syscall.

A means is required for the architecture code to make a decision once
it is known that all such headers have been seen, but before it is too
late to return from an execve syscall. For this purpose the
arch_check_elf function is added, and called once, after all PT_LOPROC
to PT_HIPROC headers have been passed to arch_elf_pt_proc but before
the code which invoked execve has been lost. This enables the
architecture code to make a decision based upon all the headers present
in an ELF binary and its interpreter, as is required to forbid
conflicting FP ABI requirements between an ELF & its interpreter.

In order to allow data to be stored throughout the calls to the above
functions, struct arch_elf_state is introduced.

Finally a variant of the SET_PERSONALITY macro is introduced which
accepts a pointer to the struct arch_elf_state, allowing it to act
based upon state observed from the architecture specific program
headers.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7679/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>


# a9d9ef13 11-Sep-2014 Paul Burton <paulburton@kernel.org>

binfmt_elf: load interpreter program headers earlier

Load the program headers of an ELF interpreter early enough in
load_elf_binary that they can be examined before it's too late to return
an error from an exec syscall. This patch does not perform any such
checking, it merely lays the groundwork for a further patch to do so.

No functional change is intended.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7675/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>


# 6a8d3894 11-Sep-2014 Paul Burton <paulburton@kernel.org>

binfmt_elf: Hoist ELF program header loading to a function

load_elf_binary & load_elf_interp both load program headers from an ELF
executable in the same way, duplicating the code. This patch introduces
a helper function (load_elf_phdrs) which performs this common task &
calls it from both load_elf_binary & load_elf_interp. In addition to
reducing code duplication, this is part of preparing to load the ELF
interpreter headers earlier such that they can be examined before it's
too late to return an error from an exec syscall.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7676/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>


# 19d860a1 04-May-2014 Al Viro <viro@zeniv.linux.org.uk>

handle suicide on late failure exits in execve() in search_binary_handler()

... rather than doing that in the guts of ->load_binary().
[updated to fix the bug spotted by Shentino - for SIGSEGV we really need
something stronger than send_sig_info(); again, better do that in one place]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b219e25f 04-Jun-2014 Fabian Frederick <fabf@skynet.be>

fs/binfmt_elf.c: fix bool assignements

Fix coccinelle warnings.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78d683e8 19-May-2014 Andy Lutomirski <luto@amacapital.net>

mm, fs: Add vm_ops->name as an alternative to arch_vma_name

arch_vma_name sucks. It's a silly hack, and it's annoying to
implement correctly. In fact, AFAICS, even the straightforward x86
implementation is incorrect (I suspect that it breaks if the vdso
mapping is split or gets remapped).

This adds a new vm_ops->name operation that can replace it. The
followup patches will remove all uses of arch_vma_name on x86,
fixing a couple of annoyances in the process.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2eee21791bb36a0a408c5c2bdb382a9e6a41ca4a.1400538962.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# ab0e113f 07-Apr-2014 Alex Thorlton <athorlton@sgi.com>

exec: kill the unnecessary mm->def_flags setting in load_elf_binary()

load_elf_binary() sets current->mm->def_flags = def_flags and def_flags
is always zero. Not only this looks strange, this is unnecessary
because mm_init() has already set ->def_flags = 0.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 69369a70 03-Apr-2014 Josh Triplett <josh@joshtriplett.org>

fs, kernel: permit disabling the uselib syscall

uselib hasn't been used since libc5; glibc does not use it. Support
turning it off.

When disabled, also omit the load_elf_library implementation from
binfmt_elf.c, which only uselib invokes.

bloat-o-meter:
add/remove: 0/4 grow/shrink: 0/1 up/down: 0/-785 (-785)
function old new delta
padzero 39 36 -3
uselib_flags 20 - -20
sys_uselib 168 - -168
SyS_uselib 168 - -168
load_elf_library 426 - -426

The new CONFIG_USELIB defaults to `y'.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7a5f4f1c 23-Jan-2014 Todor Minchev <todor@minchev.co.uk>

fs: binfmt_elf: remove unused defines INTERPRETER_NONE and INTERPRETER_ELF

These two defines are unused since the removal of the a.out interpreter
support in the ELF loader in kernel 2.6.25

Signed-off-by: Todor Minchev <todor@minchev.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# afabada9 14-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

elf{,_fdpic} coredump: get rid of pointless if (siginfo->si_signo)

we can't get to do_coredump() if that condition isn't satisfied...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ec57941e 13-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

constify do_coredump() argument

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ce395960 13-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

constify copy_siginfo_to_user{,32}()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 22a8cb82 08-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: dump_align()

dump_skip to given alignment...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9b56d543 08-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

dump_skip(): dump_seek() replacement taking coredump_params

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1ad67015 07-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

binfmt_elf: count notes towards coredump limit

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cdc3d562 05-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

switch elf_coredump_extra_notes_write() to dump_emit()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 13046ece 05-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

binfmt_elf: convert writing actual dump pages to dump_emit()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# aa3e7eaf 05-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

switch elf_core_write_extra_data() to dump_emit()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 506f21c5 05-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

switch elf_core_write_extra_phdrs() to dump_emit()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ecc8c772 05-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: dump_emit()

dump_write() analog, takes core_dump_params instead of file,
keeps track of the amount written in cprm->written and checks for
cprm->limit. Start using it in binfmt_elf.c...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 72c2d531 22-Sep-2013 Al Viro <viro@zeniv.linux.org.uk>

file->f_op is never NULL...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 72023656 30-Sep-2013 Dan Aloni <alonid@stratoscale.com>

fs/binfmt_elf.c: prevent a coredump with a large vm_map_count from Oopsing

A high setting of max_map_count, and a process core-dumping with a large
enough vm_map_count could result in an NT_FILE note not being written,
and the kernel crashing immediately later because it has assumed
otherwise.

Reproduction of the oops-causing bug described here:

https://lkml.org/lkml/2013/8/30/50

Rge ussue originated in commit 2aa362c49c31 ("coredump: extend core dump
note section to contain file names of mapped file") from Oct 4, 2012.

This patch make that section optional in that case. fill_files_note()
should signify the error, and also let the info struct in
elf_core_dump() be zero-initialized so that we can check for the
optionally written note.

[akpm@linux-foundation.org: avoid abusing E2BIG, remove a couple of not-really-needed local variables]
[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: Dan Aloni <alonid@stratoscale.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Reported-by: Martin MOKREJS <mmokrejs@gmail.com>
Tested-by: Martin MOKREJS <mmokrejs@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98d1e64f 10-Jul-2013 Michel Lespinasse <walken@google.com>

mm: remove free_area_cache

Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 079148b9 30-Apr-2013 Oleg Nesterov <oleg@redhat.com>

coredump: factor out the setting of PF_DUMPCORE

Cleanup. Every linux_binfmt->core_dump() sets PF_DUMPCORE, move this into
zap_threads() called by do_coredump().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Neil Horman <nhorman@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c1d025e2 30-Apr-2013 Jiri Kosina <jkosina@suse.cz>

binfmt_elf: PIE: make PF_RANDOMIZE check comment more accurate

The comment I originally added in commit a3defbe5c337 ("binfmt_elf: fix
PIE execution with randomization disabled") is not really 100% accurate
-- sysctl is not the only way how PF_RANDOMIZE could be forcibly unset
in runtime.

Another option of course is direct modification of personality flags
(i.e. running through setarch wrapper).

Make the comment more explicit and accurate.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2171364d 17-Apr-2013 Michael Neuling <mikey@neuling.org>

powerpc: Add HWCAP2 aux entry

We are currently out of free bits in AT_HWCAP. With POWER8, we have
several hardware features that we need to advertise.

Tested on POWER and x86.

Signed-off-by: Michael Neuling <michael@neuling.org>
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 23d9e482 17-Apr-2013 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

fs/binfmt_elf.c: fix hugetlb memory check in vma_dump_size()

Documentation/filesystems/proc.txt says about coredump_filter bitmask,

Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
effected by bit 5-6.

However current code can go into the subsequent flag checks of bit 0-4
for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
as written in the document.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c07380be 09-May-2011 James Hogan <jhogan@kernel.org>

Revert some of "binfmt_elf: cleanups"

The commit "binfmt_elf: cleanups"
(f670d0ecda73b7438eec9ed108680bc5f5362ad8) removed an ifndef elf_map but
this breaks compilation for metag which does define elf_map.

This adds the ifndef back in as it was before, but does not affect the
other cleanups made by that patch.

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Acked-by: Mikael Pettersson <mikpe@it.uu.se>


# 496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d3330cf0 21-Feb-2013 Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

binfmt_elf: remove unused argument in fill_elf_header

In fill_elf_header(), elf->e_ident[EI_OSABI] is always set to ELF_OSABI,
so remove the unused argument 'osabi'.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6fac4829 13-Nov-2012 Frederic Weisbecker <fweisbec@gmail.com>

cputime: Use accessors to read task cputime stats

This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>


# 6899e92d 17-Dec-2012 Alan Cox <alan@linux.intel.com>

binfmt_elf: fix corner case kfree of uninitialized data

If elf_core_dump() is called and fill_note_info() fails in the kmalloc()
then it returns 0 but has not yet initialised all the needed fields. As a
result we do a kfree(randomness) after correctly skipping the thread data.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 71613c3b 20-Oct-2012 Al Viro <viro@zeniv.linux.org.uk>

get rid of pt_regs argument of ->load_binary()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 314e51b9 08-Oct-2012 Konstantin Khlebnikov <khlebnikov@openvz.org>

mm: kill vma flag VM_RESERVED and mm->reserved_vm counter

A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:

| effect | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

This patch removes reserved_vm counter from mm_struct. Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.

Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0103bd16 08-Oct-2012 Konstantin Khlebnikov <khlebnikov@openvz.org>

mm: prepare VM_DONTDUMP for using in drivers

Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags:
VM_DONTEXPAND, VM_DONTCOPY. Currently this flag used only for
sys_madvise. The next patch will use it for replacing the outdated flag
VM_RESERVED.

Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL
(VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2aa362c4 04-Oct-2012 Denys Vlasenko <vda.linux@googlemail.com>

coredump: extend core dump note section to contain file names of mapped files

This note has the following format:

long count -- how many files are mapped
long page_size -- units for file_ofs
array of [COUNT] elements of
long start
long end
long file_ofs
followed by COUNT filenames in ASCII: "FILE1" NUL "FILE2" NUL...

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: "Jonathan M. Foote" <jmfoote@cert.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49ae4d4b 04-Oct-2012 Denys Vlasenko <vda.linux@googlemail.com>

coredump: add a new elf note with siginfo of the signal

Existing PRSTATUS note contains only si_signo, si_code, si_errno fields
from the siginfo of the signal which caused core to be dumped.

There are tools which try to analyze crashes for possible security
implications, and they want to use, among other data, si_addr field from
the SIGSEGV.

This patch adds a new elf note, NT_SIGINFO, which contains the complete
siginfo_t of the signal which killed the process.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: "Jonathan M. Foote" <jmfoote@cert.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ab1c309 04-Oct-2012 Denys Vlasenko <vda.linux@googlemail.com>

coredump: pass siginfo_t* to do_coredump() and below, not merely signr

This is a preparatory patch for the introduction of NT_SIGINFO elf note.

With this patch we pass "siginfo_t *siginfo" instead of "int signr" to
do_coredump() and put it into coredump_params. It will be used by the
next patch. Most changes are simple s/signr/siginfo->si_signo/.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: "Jonathan M. Foote" <jmfoote@cert.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6eec482f 04-Oct-2012 Alan Cox <alan@linux.intel.com>

binfmt_elf: Uninitialized variable

load_elf_interp() has interp_map_addr carefully described as
"uninitialized_var" and marked so as to avoid a warning. However if you
trace the code it is passed into load_elf_interp and then this value is
checked against NULL.

As this return value isn't used this is actually safe but it freaks
various analysis tools that see un-initialized memory addresses being read
before their value is ever defined.

Set it to NULL as a matter of programming good taste if nothing else

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f34f9d18 25-Sep-2012 Denys Vlasenko <vda.linux@googlemail.com>

coredump: prevent double-free on an error path in core dumper

In !CORE_DUMP_USE_REGSET case, if elf_note_info_init fails to allocate
memory for info->fields, it frees already allocated stuff and returns
error to its caller, fill_note_info. Which in turn returns error to its
caller, elf_core_dump. Which jumps to cleanup label and calls
free_note_info, which will happily try to free all info->fields again.
BOOM.

This is the fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Venu Byravarasu <vbyravarasu@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 826eba4d 02-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

the only place that needs to include asm/exec.h is linux/binfmts.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5a5e4c2e 29-May-2012 Al Viro <viro@zeniv.linux.org.uk>

binfmt_elf: switch elf_map() to vm_mmap/vm_munmap

No reason to hold ->mmap_sem over the sequence

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ebc887b2 07-Feb-2012 Eric W. Biederman <ebiederm@xmission.com>

userns: Convert binary formats to use kuid/kgid where appropriate

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 6be5ceb0 20-Apr-2012 Linus Torvalds <torvalds@linux-foundation.org>

VM: add "vm_mmap()" helper function

This continues the theme started with vm_brk() and vm_munmap():
vm_mmap() does the same thing as do_mmap(), but additionally does the
required VM locking.

This uninlines (and rewrites it to be clearer) do_mmap(), which sadly
duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have
to export our internal do_mmap_pgoff() function.

Some day we hopefully don't have to export do_mmap() either, if all
modular users can become the simpler vm_mmap() instead. We're actually
very close to that already, with the notable exception of the (broken)
use in i810, and a couple of stragglers in binfmt_elf.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e4eb1ff6 20-Apr-2012 Linus Torvalds <torvalds@linux-foundation.org>

VM: add "vm_brk()" helper function

It does the same thing as "do_brk()", except it handles the VM locking
too.

It turns out that all external callers want that anyway, so we can make
do_brk() static to just mm/mmap.c while at it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 96f951ed 28-Mar-2012 David Howells <dhowells@redhat.com>

Add #includes needed to permit the removal of asm/system.h

asm/system.h is a cause of circular dependency problems because it contains
commonly used primitive stuff like barrier definitions and uncommonly used
stuff like switch_to() that might require MMU definitions.

asm/system.h has been disintegrated by this point on all arches into the
following common segments:

(1) asm/barrier.h

Moved memory barrier definitions here.

(2) asm/cmpxchg.h

Moved xchg() and cmpxchg() here. #included in asm/atomic.h.

(3) asm/bug.h

Moved die() and similar here.

(4) asm/exec.h

Moved arch_align_stack() here.

(5) asm/elf.h

Moved AT_VECTOR_SIZE_ARCH here.

(6) asm/switch_to.h

Moved switch_to() here.

Signed-off-by: David Howells <dhowells@redhat.com>


# accb61fe 23-Mar-2012 Jason Baron <jbaron@redhat.com>

coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP

Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag:
MADV_DONTDUMP, which can be set by applications to specifically request
memory regions which should not dump core.

The specific application I have in mind is qemu: we can add a flag there
that wouldn't dump all of guest memory when qemu dumps core. This flag
might also be useful for security sensitive apps that want to absolutely
make sure that parts of memory are not dumped. To clear the flag use:
MADV_DODUMP.

[akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
[akpm@linux-foundation.org: fix up the architectures which broke]
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 909af768 23-Mar-2012 Jason Baron <jbaron@redhat.com>

coredump: remove VM_ALWAYSDUMP flag

The motivation for this patchset was that I was looking at a way for a
qemu-kvm process, to exclude the guest memory from its core dump, which
can be quite large. There are already a number of filter flags in
/proc/<pid>/coredump_filter, however, these allow one to specify 'types'
of kernel memory, not specific address ranges (which is needed in this
case).

Since there are no more vma flags available, the first patch eliminates
the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by
the kernel to mark vdso and vsyscall pages. However, it is simple
enough to check if a vma covers a vdso or vsyscall page without the need
for this flag.

The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters
continue to work the same as before unless 'MADV_DONTDUMP' is set on the
region.

The qemu code which implements this features is at:

http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch

In my testing the qemu core dump shrunk from 383MB -> 13MB with this
patch.

I also believe that the 'MADV_DONTDUMP' flag might be useful for
security sensitive apps, which might want to select which areas are
dumped.

This patch:

The VM_ALWAYSDUMP flag is currently used by the coredump code to
indicate that a vma is part of a vsyscall or vdso section. However, we
can determine if a vma is in one these sections by checking it against
the gate_vma and checking for a non-NULL return value from
arch_vma_name(). Thus, freeing a valuable vma bit.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 19e5109f 23-Feb-2012 Al Viro <viro@zeniv.linux.org.uk>

take removal of PF_FORKNOEXEC to flush_old_exec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8fc3dc5a 17-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

__register_binfmt() made void

Just don't pass NULL to it - nobody does, anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c8e25258 02-Mar-2012 H. Peter Anvin <hpa@zytor.com>

regset: Prevent null pointer reference on readonly regsets

The regset common infrastructure assumed that regsets would always
have .get and .set methods, but not necessarily .active methods.
Unfortunately people have since written regsets without .set methods.

Rather than putting in stub functions everywhere, handle regsets with
null .get or .set methods explicitly.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0953f65d 14-Feb-2012 H. J. Lu <hjl.tools@gmail.com>

elf: Allow core dump-related fields to be overridden

Allow some core dump-related fields to be overridden. This allows
core dumps to work correctly for x32.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>


# e39f5602 10-Jan-2012 David Daney <ddaney.cavm@gmail.com>

fs: binfmt_elf: create Kconfig variable for PIE randomization

Randomization of PIE load address is hard coded in binfmt_elf.c for X86
and ARM. Create a new Kconfig variable
(CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE) for this and use it instead. Thus
architecture specific policy is pushed out of the generic binfmt_elf.c and
into the architecture Kconfig files.

X86 and ARM Kconfigs are modified to select the new variable so there is
no change in behavior. A follow on patch will select it for MIPS too.

Signed-off-by: David Daney <david.daney@cavium.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a3defbe5 02-Nov-2011 Jiri Kosina <jkosina@suse.cz>

binfmt_elf: fix PIE execution with randomization disabled

The case of address space randomization being disabled in runtime through
randomize_va_space sysctl is not treated properly in load_elf_binary(),
resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
in case the randomization has been disabled at runtime prior to calling
exec().

Handle the randomize_va_space == 0 case the same way as if we were not
supporting .text randomization at all.

Based on original patch by H.J. Lu and Josh Boyer.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: H.J. Lu <hongjiu.lu@intel.com>
Cc: <stable@kernel.org>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1b5d783c 18-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling

new helper: would_dump(bprm, file). Checks if we are allowed to
read the file and if we are not - sets ENFORCE_NODUMP. Exported,
used in places that previously open-coded the same logics.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4471a675 14-Apr-2011 Jiri Kosina <jkosina@suse.cz>

brk: COMPAT_BRK: fix detection of randomized brk

5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK")
tried to get the whole logic of brk randomization for legacy
(libc5-based) applications finally right.

It turns out that the way to detect whether brk has actually been
randomized in the end or not introduced by that patch still doesn't work
for those binaries, as reported by Geert:

: /sbin/init from my old m68k ramdisk exists prematurely.
:
: Before the patch:
:
: | brk(0x80005c8e) = 0x80006000
:
: After the patch:
:
: | brk(0x80005c8e) = 0x80005c8e
:
: Old libc5 considers brk() to have failed if the return value is not
: identical to the requested value.

I don't like it, but currently see no better option than a bit flag in
task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2
case.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 31db58b3 13-Mar-2011 Stephen Wilson <wilsons@start.ca>

mm: arch: make get_gate_vma take an mm_struct instead of a task_struct

Morally, the presence of a gate vma is more an attribute of a particular mm than
a particular task. Moreover, dropping the dependency on task_struct will help
make both existing and future operations on mm's more flexible and convenient.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1a530a6f 22-Mar-2011 David Daney <ddaney@caviumnetworks.com>

binfmt_elf: quiet GCC-4.6 'set but not used' warning in load_elf_binary()

With GCC-4.6 we get warnings about things being 'set but not used'.

In load_elf_binary() this can happen with reloc_func_desc if ELF_PLAT_INIT
is defined, but doesn't use the reloc_func_desc argument.

Quiet the warning/error by marking reloc_func_desc as __maybe_unused.

Signed-off-by: David Daney <ddaney@caviumnetworks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f670d0ec 12-Jan-2011 Mikael Pettersson <mikpe@it.uu.se>

binfmt_elf: cleanups

This cleans up a few bits in binfmt_elf.c and binfmts.h:

- the hasvdso field in struct linux_binfmt is unused, so remove it and
the only initialization of it

- the elf_map CPP symbol is not defined anywhere in the kernel, so
remove an unnecessary #ifndef elf_map

- reduce excessive indentation in elf_format's initializer

- add missing spaces, remove extraneous spaces

No functional changes, but tested on x86 (32 and 64 bit), powerpc (32 and
64 bit), sparc64, arm, and alpha.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e4eab08d 20-Aug-2010 Nicolas Pitre <nico@fluxnic.net>

ARM: 6342/1: fix ASLR of PIE executables

Since commits 990cb8acf2 and cc92c28b2d, it is possible to have full
address space layout randomization (ASLR) on ARM. Except that one small
change was missing for ASLR of PIE executables.

Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>


# 30736a4d 05-Mar-2010 Masami Hiramatsu <mhiramat@redhat.com>

coredump: pass mm->flags as a coredump parameter for consistency

Pass mm->flags as a coredump parameter for consistency.

---
1787 if (mm->core_state || !get_dumpable(mm)) { <- (1)
1788 up_write(&mm->mmap_sem);
1789 put_cred(cred);
1790 goto fail;
1791 }
1792
[...]
1798 if (get_dumpable(mm) == 2) { /* Setuid core dump mode */ <-(2)
1799 flag = O_EXCL; /* Stop rewrite attacks */
1800 cred->fsuid = 0; /* Dump root private */
1801 }
---

Since dumpable bits are not protected by lock, there is a chance to change
these bits between (1) and (2).

To solve this issue, this patch copies mm->flags to
coredump_params.mm_flags at the beginning of do_coredump() and uses it
instead of get_dumpable() while dumping core.

This copy is also passed to binfmt->core_dump, since elf*_core_dump() uses
dump_filter bits in mm->flags.

[akpm@linux-foundation.org: fix merge]
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d9032bb 05-Mar-2010 Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>

elf coredump: add extended numbering support

The current ELF dumper implementation can produce broken corefiles if
program headers exceed 65535. This number is determined by the number of
vmas which the process have. In particular, some extreme programs may use
more than 65535 vmas. (If you google max_map_count, you can find some
users facing this problem.) This kind of program never be able to generate
correct coredumps.

This patch implements ``extended numbering'' that uses sh_info field of
the first section header instead of e_phnum field in order to represent
upto 4294967295 vmas.

This is supported by
AMD64-ABI(http://www.x86-64.org/documentation.html) and
Solaris(http://docs.sun.com/app/docs/doc/817-1984/).
Of course, we are preparing patches for gdb and binutils.

Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 93eb211e 05-Mar-2010 Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>

elf coredump: make offset calculation process and writing process explicit

By the next patch, elf_core_dump() and elf_fdpic_core_dump() will support
extended numbering and so will produce the corefiles with section header
table in a special case.

The problem is the process of writing a file header offset of the section
header table into e_shoff field of the ELF header. ELF header is
positioned at the beginning of the corefile, while section header at the
end. So, we need to take which of the following ways:

1. Seek backward to retry writing operation for ELF header
after writing process for a whole part

2. Make offset calculation process and writing process
totally sequential

The clause 1. is not always possible: one cannot assume that file system
supports seek function. Consider the no_llseek case.

Therefore, this patch adopts the clause 2.

Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1fcccbac 05-Mar-2010 Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>

elf coredump: replace ELF_CORE_EXTRA_* macros by functions

elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding
macro for hiding _multiline_ logics in functions. This patch removes
#ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions. For
architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in
order to reduce a range of modification.

This cleanup is for my next patches, but I think this cleanup itself is
worth doing regardless of my firnal purpose.

Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 088e7af7 05-Mar-2010 Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>

coredump: move dump_write() and dump_seek() into a header file

My next patch will replace ELF_CORE_EXTRA_* macros by functions, putting
them into other newly created *.c files. Then, each files will contain
dump_write(), where each pair of binfmt_*.c and elfcore.c should be the
same. So, this patch moves them into a header file with dump_seek().
Also, the patch deletes confusing DUMP_WRITE macros in each files.

Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 221af7f8 28-Jan-2010 Linus Torvalds <torvalds@linux-foundation.org>

Split 'flush_old_exec' into two functions

'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.

Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.

As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.

This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f6151dfe 17-Dec-2009 Masami Hiramatsu <mhiramat@redhat.com>

mm: introduce coredump parameter structure

Introduce coredump parameter data structure (struct coredump_params) to
simplify binfmt->core_dump() arguments.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Suggested-by: Ingo Molnar <mingo@elte.hu>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 698ba7b5 15-Dec-2009 Christoph Hellwig <hch@lst.de>

elf: kill USE_ELF_CORE_DUMP

Currently all architectures but microblaze unconditionally define
USE_ELF_CORE_DUMP. The microblaze omission seems like an error to me, so
let's kill this ifdef and make sure we are the same everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Simek <michal.simek@petalogix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# af901ca1 14-Nov-2009 André Goddard Rosa <andre.goddard@gmail.com>

tree-wide: fix assorted typos all over the place

That is "success", "unknown", "through", "performance", "[re|un]mapping"
, "access", "default", "reasonable", "[con]currently", "temperature"
, "channel", "[un]used", "application", "example","hierarchy", "therefore"
, "[over|under]flow", "contiguous", "threshold", "enough" and others.

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 0cf062d0 23-Sep-2009 Amerigo Wang <amwang@redhat.com>

elf: clean up fill_note_info()

Introduce a helper function elf_note_info_init() to help fill_note_info()
to do initializations, also fix the potential memory leaks.

[akpm@linux-foundation.org: remove NUM_NOTES]
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f3e8fccd 21-Sep-2009 Hugh Dickins <hugh.dickins@tiscali.co.uk>

mm: add get_dump_page

In preparation for the next patch, add a simple get_dump_page(addr)
interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
get_user_pages() directly. They're not interested in errors: they
just want to use holes as much as possible, to save space and make
sure that the data is aligned where the headers said it would be.

Oh, and don't use that horrid DUMP_SEEK(off) macro!

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9f0ab4a3 08-Sep-2009 Roland McGrath <roland@redhat.com>

binfmt_elf: fix PT_INTERP bss handling

In fs/binfmt_elf.c, load_elf_interp() calls padzero() for .bss even if
the PT_LOAD has no PROT_WRITE and no .bss. This generates EFAULT.

Here is a small test case. (Yes, there are other, useful PT_INTERP
which have only .text and no .data/.bss.)

----- ptinterp.S
_start: .globl _start
nop
int3
-----
$ gcc -m32 -nostartfiles -nostdlib -o ptinterp ptinterp.S
$ gcc -m32 -Wl,--dynamic-linker=ptinterp -o hello hello.c
$ ./hello
Segmentation fault # during execve() itself

After applying the patch:
$ ./hello
Trace trap # user-mode execution after execve() finishes

If the ELF headers are actually self-inconsistent, then dying is fine.
But having no PROT_WRITE segment is perfectly normal and correct if
there is no segment with p_memsz > p_filesz (i.e. bss). John Reiser
suggested checking for PROT_WRITE in the bss logic. I think it makes
most sense to simply apply the bss logic only when there is bss.

This patch looks less trivial than it is due to some reindentation.
It just moves the "if (last_bss > elf_bss) {" test up to include the
partial-page bss logic as well as the more-pages bss logic.

Reported-by: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 752015d1 08-Sep-2009 Roland McGrath <roland@redhat.com>

binfmt_elf: fix PT_INTERP bss handling

In fs/binfmt_elf.c, load_elf_interp() calls padzero() for .bss even if
the PT_LOAD has no PROT_WRITE and no .bss. This generates EFAULT.

Here is a small test case. (Yes, there are other, useful PT_INTERP
which have only .text and no .data/.bss.)

----- ptinterp.S
_start: .globl _start
nop
int3
-----
$ gcc -m32 -nostartfiles -nostdlib -o ptinterp ptinterp.S
$ gcc -m32 -Wl,--dynamic-linker=ptinterp -o hello hello.c
$ ./hello
Segmentation fault # during execve() itself

After applying the patch:
$ ./hello
Trace trap # user-mode execution after execve() finishes

If the ELF headers are actually self-inconsistent, then dying is fine.
But having no PROT_WRITE segment is perfectly normal and correct if
there is no segment with p_memsz > p_filesz (i.e. bss). John Reiser
suggested checking for PROT_WRITE in the bss logic. I think it makes
most sense to simply apply the bss logic only when there is bss.

This patch looks less trivial than it is due to some reindentation.
It just moves the "if (last_bss > elf_bss) {" test up to include the
partial-page bss logic as well as the more-pages bss logic.

Reported-by: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e2dbe125 30-Jun-2009 Amerigo Wang <amwang@redhat.com>

elf: fix one check-after-use

Check before use it.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 341c87bf 30-Jun-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

elf: limit max map count to safe value

With ELF, at generating coredump, some more headers other than used
vmas are added.

When max_map_count == 65536, a core generated by following kinds of
code can be unreadable because the number of ELF's program header is
written in 16bit in Ehdr (please see elf.h) and the number overflows.

==
... = mmap(); (munmap, mprotect, etc...)
if (failed)
abort();
==

This can happen in mmap/munmap/mprotect/etc...which calls split_vma().

I think 65536 is not safe as _default_ and reduce it to 65530 is good
for avoiding unexpected corrupted core.

Anyway, max_map_count can be enlarged by sysctl if a user is brave..

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Jakub Jelinek <jakub@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3b34fc58 17-Jun-2009 Oleg Nesterov <oleg@redhat.com>

elf_core_dump: use rcu_read_lock() to access ->real_parent

In theory it is not safe to dereference ->parent/real_parent without
tasklist or rcu lock, we can race with re-parenting.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d0f35dde 29-Mar-2009 Al Viro <viro@zeniv.linux.org.uk>

Trim includes in binfmt_elf

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e7b9b550 29-Mar-2009 Al Viro <viro@zeniv.linux.org.uk>

Don't mess with descriptor table in load_elf_binary()

... since we don't tell anyone which descriptor does the file get.
We used to, but only in case of ELF binary with a.out loader and
that stuff has been gone for a while.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 92dc07b1 06-Feb-2009 Roland McGrath <roland@redhat.com>

elf core dump: fix get_user use

The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
use get_user() for a normal user-space address.

Checking for VM_READ optimizes out the case where get_user() would fail
anyway. The vm_file check here was already superfluous given the control
flow earlier in the function, so that is a cleanup/optimization unrelated
to other changes but an obvious and trivial one.

Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Roland McGrath <roland@redhat.com>


# f06295b4 07-Jan-2009 Kees Cook <keescook@chromium.org>

ELF: implement AT_RANDOM for glibc PRNG seeding

While discussing[1] the need for glibc to have access to random bytes
during program load, it seems that an earlier attempt to implement
AT_RANDOM got stalled. This implements a random 16 byte string, available
to every ELF program via a new auxv AT_RANDOM vector.

[1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html

Ulrich said:

glibc needs right after startup a bit of random data for internal
protections (stack canary etc). What is now in upstream glibc is that we
always unconditionally open /dev/urandom, read some data, and use it. For
every process startup. That's slow.

...

The solution is to provide a limited amount of random data to the
starting process in the aux vector. I suggested 16 bytes and this is
what the patch implements. If we need only 16 bytes or less we use the
data directly. If we need more we'll use the 16 bytes to see a PRNG.
This avoids the costly /dev/urandom use and it allows the kernel to use
the most adequate source of random data for this purpose. It might not
be the same pool as that for /dev/urandom.

Concerns were expressed about the depletion of the randomness pool. But
this patch doesn't make the situation worse, it doesn't deplete entropy
more than happens now.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fc5243d9 25-Dec-2008 Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] arch_setup_additional_pages arguments

arch_setup_additional_pages currently gets two arguments, the binary
format descripton and an indication if the process uses an executable
stack or not. The second argument is not used by anybody, it could
be removed without replacement.

What actually does make sense is to pass an indication if the process
uses the elf interpreter or not. The glibc code will not use anything
from the vdso if the process does not use the dynamic linker, so for
statically linked binaries the architecture backend can choose not
to map the vdso.

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# a6f76f23 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Make execve() take advantage of copy-on-write credentials

Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.

This patch and the preceding patches have been tested with the LTP SELinux
testsuite.

This patch makes several logical sets of alteration:

(1) execve().

The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.

I would like to replace bprm->cap_effective with:

cap_isclear(bprm->cap_effective)

but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).

The following sequence of events now happens:

(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.

(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.

This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.

(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.

(c) prepare_binprm() is called, possibly multiple times.

(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.

(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.

This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.

(iii) bprm->cred_prepared is set to 1.

bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.

(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:

(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().

(ii) Clear any bits in current->personality that were deferred from
(c.i).

(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:

(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.

This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).

(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.

(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.

(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.

(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.

(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.

(2) LSM interface.

A number of functions have been changed, added or removed:

(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()

Removed in favour of preparing new credentials and modifying those.

(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()

Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().

(*) security_bprm_set(), ->bprm_set_security()

Removed; folded into security_bprm_set_creds().

(*) security_bprm_set_creds(), ->bprm_set_creds()

New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.

(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()

New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.

The former may access bprm->cred, the latter may not.

(3) SELinux.

SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:

(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.

(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# c69e8d9c 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Use RCU to access another task's creds and to release a task's own creds

Use RCU to access another task's creds and to release a task's own creds.
This means that it will be possible for the credentials of a task to be
replaced without another task (a) requiring a full lock to read them, and (b)
seeing deallocated memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 86a264ab 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Wrap current->cred and a few other accessors

Wrap current->cred and a few other accessors to hide their actual
implementation.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# b6dff3ec 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Separate task security context from task_struct

Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# e575f111 18-Oct-2008 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

coredump_filter: add hugepage dumping

Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped. But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g. vmalloc, sound related).

Thus hugepages are never dumped and it can't be debugged easily. Many
developers want hugepages to be included into core-dump.

However, We can't read generic VM_RESERVED area because this area is often
IO mapping area. then these area reading may change device state. it is
definitly undesiable side-effect.

So adding a hugepage specific bit to the coredump filter is better. It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.

In additional, libhugetlb use hugetlb private mapping pages as anonymous
page. Then, hugepage private mapping pages should be core dumped by
default.

Then, /proc/[pid]/core_dump_filter has two new bits.

- bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
- bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)

I tested by following method.

% ulimit -c unlimited
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core
%
% echo 0x43 > /proc/self/coredump_filter
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

#include "hugetlbfs.h"

int main(int argc, char** argv){
char* p;
int ch;
int mmap_flags = MAP_SHARED;
int fd;
int nr_pages;

while((ch = getopt(argc, argv, "p")) != -1) {
switch (ch) {
case 'p':
mmap_flags &= ~MAP_SHARED;
mmap_flags |= MAP_PRIVATE;
break;
default:
/* nothing*/
break;
}
}
argc -= optind;
argv += optind;

if (argc == 0){
printf("need # of pages\n");
exit(1);
}

nr_pages = atoi(argv[0]);
if (nr_pages < 2) {
printf("nr_pages must >2\n");
exit(1);
}

fd = hugetlbfs_unlinked_fd();
p = mmap(NULL, nr_pages * gethugepagesize(),
PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

sleep(2);

*(p + gethugepagesize()) = 1; /* COW */
sleep(2);

/* crash! */
*(int*)0 = 1;

return 0;
}

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b592682 16-Oct-2008 Martin Schwidefsky <schwidefsky@de.ibm.com>

[PATCH] remove unused ibcs2/PER_SVR4 in SET_PERSONALITY

The SET_PERSONALITY macro is always called with a second argument of 0.
Remove the ibcs argument and the various tests to set the PER_SVR4
personality.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# f06febc9 12-Sep-2008 Frank Mayhar <fmayhar@google.com>

timers: fix itimer/many thread hang

Overview

This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
with the help of Roland McGrath, the owner and original writer of this code.

The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads. It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.

This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

Code Changes

This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine. (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.) To do this, at each tick we now update fields in
signal_struct as well as task_struct. The run_posix_cpu_timers() function
uses those fields to make its decisions.

We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:

struct task_cputime {
cputime_t utime;
cputime_t stime;
unsigned long long sum_exec_runtime;
};

This is included in the structure "thread_group_cputime," which is a new
substructure of signal_struct and which varies for uniprocessor versus
multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
a simple substructure, while for multiprocessor kernels it is a pointer:

struct thread_group_cputime {
struct task_cputime totals;
};

struct thread_group_cputime {
struct task_cputime *totals;
};

We also add a new task_cputime substructure directly to signal_struct, to
cache the earliest expiration of process-wide timers, and task_cputime also
replaces the it_*_expires fields of task_struct (used for earliest expiration
of thread timers). The "thread_group_cputime" structure contains process-wide
timers that are updated via account_user_time() and friends. In the non-SMP
case the structure is a simple aggregator; unfortunately in the SMP case that
simplicity was not achievable due to cache-line contention between CPUs (in
one measured case performance was actually _worse_ on a 16-cpu system than
the same test on a 4-cpu system, due to this contention). For SMP, the
thread_group_cputime counters are maintained as a per-cpu structure allocated
using alloc_percpu(). The timer functions update only the timer field in
the structure corresponding to the running CPU, obtained using per_cpu_ptr().

We define a set of inline functions in sched.h that we use to maintain the
thread_group_cputime structure and hide the differences between UP and SMP
implementations from the rest of the kernel. The thread_group_cputime_init()
function initializes the thread_group_cputime structure for the given task.
The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
in the per-cpu structures and fields. The thread_group_cputime_free()
function, also a no-op for UP, in SMP frees the per-cpu structures. The
thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
thread_group_cputime_alloc() if the per-cpu structures haven't yet been
allocated. The thread_group_cputime() function fills the task_cputime
structure it is passed with the contents of the thread_group_cputime fields;
in UP it's that simple but in SMP it must also safely check that tsk->signal
is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
if so, sums the per-cpu values for each online CPU. Finally, the three
functions account_group_user_time(), account_group_system_time() and
account_group_exec_runtime() are used by timer functions to update the
respective fields of the thread_group_cputime structure.

Non-SMP operation is trivial and will not be mentioned further.

The per-cpu structure is always allocated when a task creates its first new
thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
It is freed at process exit via a call to thread_group_cputime_free() from
cleanup_signal().

All functions that formerly summed utime/stime/sum_sched_runtime values from
from all threads in the thread group now use thread_group_cputime() to
snapshot the values in the thread_group_cputime structure or the values in
the task structure itself if the per-cpu structure hasn't been allocated.

Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
The run_posix_cpu_timers() function has been split into a fast path and a
slow path; the former safely checks whether there are any expired thread
timers and, if not, just returns, while the slow path does the heavy lifting.
With the dedicated thread group fields, timers are no longer "rebalanced" and
the process_timer_rebalance() function and related code has gone away. All
summing loops are gone and all code that used them now uses the
thread_group_cputime() inline. When process-wide timers are set, the new
task_cputime structure in signal_struct is used to cache the earliest
expiration; this is checked in the fast path.

Performance

The fix appears not to add significant overhead to existing operations. It
generally performs the same as the current code except in two cases, one in
which it performs slightly worse (Case 5 below) and one in which it performs
very significantly better (Case 2 below). Overall it's a wash except in those
two cases.

I've since done somewhat more involved testing on a dual-core Opteron system.

Case 1: With no itimer running, for a test with 100,000 threads, the fixed
kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
all of which was spent in the system. There were twice as many
voluntary context switches with the fix as without it.

Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
an unmodified kernel can handle), the fixed kernel ran the test in
eight percent of the time (5.8 seconds as opposed to 70 seconds) and
had better tick accuracy (.012 seconds per tick as opposed to .023
seconds per tick).

Case 3: A 4000-thread test with an initial timer tick of .01 second and an
interval of 10,000 seconds (i.e. a timer that ticks only once) had
very nearly the same performance in both cases: 6.3 seconds elapsed
for the fixed kernel versus 5.5 seconds for the unfixed kernel.

With fewer threads (eight in these tests), the Case 1 test ran in essentially
the same time on both the modified and unmodified kernels (5.2 seconds versus
5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
tick versus .025 seconds per tick for the unmodified kernel.

Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
running), the modified kernel was very slightly favored in that while
it killed the process in 19.997 seconds of CPU time (5.002 seconds of
wall time), only .003 seconds of that was system time, the rest was
user time. The unmodified kernel killed the process in 20.001 seconds
of CPU (5.014 seconds of wall time) of which .016 seconds was system
time. Really, though, the results were too close to call. The results
were essentially the same with no itimer running.

Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
(where the hard limit would never be reached) and an itimer running,
the modified kernel exhibited worse tick accuracy than the unmodified
kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
performance was almost indistinguishable. With no itimer running this
test exhibited virtually identical behavior and times in both cases.

In times past I did some limited performance testing. those results are below.

On a four-cpu Opteron system without this fix, a sixteen-thread test executed
in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
the same system with the fix, user and elapsed time were about the same, but
system time dropped to 0.007 seconds. Performance with eight, four and one
thread were comparable. Interestingly, the timer ticks with the fix seemed
more accurate: The sixteen-thread test with the fix received 149543 ticks
for 0.024 seconds per tick, while the same test without the fix received 58720
for 0.061 seconds per tick. Both cases were configured for an interval of
0.01 seconds. Again, the other tests were comparable. Each thread in this
test computed the primes up to 25,000,000.

I also did a test with a large number of threads, 100,000 threads, which is
impossible without the fix. In this case each thread computed the primes only
up to 10,000 (to make the runtime manageable). System time dominated, at
1546.968 seconds out of a total 2176.906 seconds (giving a user time of
629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
accurate. There is obviously no comparable test without the fix.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 6341c393 25-Jul-2008 Roland McGrath <roland@redhat.com>

tracehook: exec

This moves all the ptrace hooks related to exec into tracehook.h inlines.

This also lifts the calls for tracing out of the binfmt load_binary hooks
into search_binary_handler() after it calls into the binfmt module. This
change has no effect, since all the binfmt modules' load_binary functions
did the call at the end on success, and now search_binary_handler() does
it immediately after return if successful. We consolidate the repeated
code, and binfmt modules no longer need to import ptrace_notify().

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 83914441 25-Jul-2008 Oleg Nesterov <oleg@tv-sign.ru>

coredump: elf_core_dump: use core_state->dumper list

Kill the nasty rcu_read_lock() + do_each_thread() loop, use the list
encoded in mm->core_state instead, s/GFP_ATOMIC/GFP_KERNEL/.

This patch allows futher cleanups in binfmt_elf.c, in particular we can
kill the parallel info->threads list.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 24d5288f 25-Jul-2008 Oleg Nesterov <oleg@tv-sign.ru>

coredump: elf_core_dump: skip kernel threads

linux_binfmt->core_dump() runs before the process does exit_aio(), this
means that we can hit the kernel thread which shares the same ->mm.
Afaics, nothing really bad can happen, but perhaps it makes sense to fix
this minor bug.

It is sad we have to iterate over all threads in system and use
GFP_ATOMIC. Hopefully we can kill theses ugly do_each_thread()s, but this
needs some nontrivial changes in mm_struct and do_coredump.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 483fad1c 21-Jul-2008 Nathan Lynch <ntl@pobox.com>

ELF loader support for auxvec base platform string

Some IBM POWER-based platforms have the ability to run in a
mode which mostly appears to the OS as a different processor from the
actual hardware. For example, a Power6 system may appear to be a
Power5+, which makes the AT_PLATFORM value "power5+". This means that
programs are restricted to the ISA supported by Power5+;
Power6-specific instructions are treated as illegal.

However, some applications (virtual machines, optimized libraries) can
benefit from knowledge of the underlying CPU model. A new aux vector
entry, AT_BASE_PLATFORM, will denote the actual hardware. For
example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM
will be "power5+" and AT_BASE_PLATFORM will be "power6". The idea is
that AT_PLATFORM indicates the instruction set supported, while
AT_BASE_PLATFORM indicates the underlying microarchitecture.

If the architecture has defined ELF_BASE_PLATFORM, copy that value to
the user stack in the same manner as ELF_PLATFORM.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 65191087 21-Jul-2008 John Reiser <jreiser@BitWagon.com>

execve filename: document and export via auxiliary vector

The Linux kernel puts the filename argument of execve() into the new
address space. Many developers are surprised to learn this. Those who
know and could use it, object "But it's not documented."

Those who want to use it dislike the expression
(char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env])
because it requires locating the last original environment variable,
and assumes that the filename follows the characters.

This patch documents the insertion of the filename, and makes it easier
to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>.

In many cases readlink("/proc/self/exe",) gives the same answer. But if
all the original pages get unmapped, then the kernel erases the symlink
for /proc/self/exe. This can happen when a program decompressor does a
good job of cleaning up after uncompressing directly to memory, so that
the address space of the target program looks the same as if compression
had never happened. One example is http://upx.sourceforge.net .

One notable use of the underlying concept (what path containED the
executable) is glibc expanding $ORIGIN in DT_RUNPATH. In practice for
the near term, it may be a good idea for user-mode code to use both
/proc/self/exe and AT_EXECFN as fall-back methods for each other.
/proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it
won't be present on non-new systems. The auxvec or {AT_EXECFN}.d_val
also can get overwritten, although in nearly all cases this would be the
result of a bug.

The runtime cost is one NEW_AUX_ENT using two words of stack space. The
underlying value is maintained already as bprm->exec; setup_arg_pages()
in fs/exec.c slides it for stack_shift, etc.

Signed-off-by: John Reiser <jreiser@BitWagon.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9e0f529 15-Jun-2008 David Woodhouse <dwmw2@infradead.org>

Remove last traces of a.out support from ELF loader.

In commit d20894a23708c2af75966534f8e4dedb46d48db2 ("Remove a.out
interpreter support in ELF loader"), Andi removed support for a.out
interpreters from the ELF loader, which was only ever needed for the
transition from a.out to ELF.

This removes the last traces of that support, in particular the
inclusion of <linux/a.out.h>.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23c4971e 08-May-2008 WANG Cong <xiyou.wangcong@gmail.com>

[Patch] fs/binfmt_elf.c: fix wrong return values

create_elf_tables() returns 0 on success. But when strnlen_user() "fails",
it returns 0 directly. So this is wrong.

Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5f719558 05-May-2008 WANG Cong <xiyou.wangcong@gmail.com>

[Patch] fs/binfmt_elf.c: fix a wrong free

In kmalloc failing path, we shouldn't free pointers in 'info',
because the struct 'info' is uninitilized when kmalloc is called.

And when kmalloc returns NULL, it's needless to kfree it.

Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

--
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4220b7fe 29-Apr-2008 WANG Cong <xiyou.wangcong@gmail.com>

elf: fix shadowed variables in fs/binfmt_elf.c

Fix these sparse warings:
fs/binfmt_elf.c:1749:29: warning: symbol 'tmp' shadows an earlier one
fs/binfmt_elf.c:1734:28: originally declared here
fs/binfmt_elf.c:2009:26: warning: symbol 'vma' shadows an earlier one
fs/binfmt_elf.c:1892:24: originally declared here

[akpm@linux-foundation.org: chose better variable name]
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6970c8ef 29-Apr-2008 Cyrill Gorcunov <gorcunov@gmail.com>

BINFMT: fill_elf_header cleanup - use straight memset first

This patch does simplify fill_elf_header function by setting
to zero the whole elf header first. So we fillup the fields
we really need only.

before:
text data bss dec hex filename
11735 80 0 11815 2e27 fs/binfmt_elf.o

after:
text data bss dec hex filename
11710 80 0 11790 2e0e fs/binfmt_elf.o

viola, 25 bytes of text is freed

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd8328be 22-Apr-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] sanitize handling of shared descriptor tables in failing execve()

* unshare_files() can fail; doing it after irreversible actions is wrong
and de_thread() is certainly irreversible.
* since we do it unconditionally anyway, we might as well do it in do_execve()
and save ourselves the PITA in binfmt handlers, etc.
* while we are at it, binfmt_som actually leaked files_struct on failure.

As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
become unexported.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d31472b6 04-Mar-2008 Roland McGrath <roland@redhat.com>

core dump: user_regset writeback

This makes the user_regset-based core dump code call user_regset writeback
hooks when available. This is necessary groundwork to allow IA64 to set
CORE_DUMP_USE_REGSET.

Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d20894a2 08-Feb-2008 Andi Kleen <andi@firstfloor.org>

Remove a.out interpreter support in ELF loader

Following the deprecation schedule the a.out ELF interpreter support
is removed now with this patch. a.out ELF interpreters were an transition
feature for moving a.out systems to ELF, but they're unlikely to be still
needed. Pure a.out systems will still work of course. This allows to
simplify the hairy ELF loader.

Signed-off-by: Andi Kleen <ak@suse.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7fa30315 08-Feb-2008 David Howells <dhowells@redhat.com>

aout: suppress A.OUT library support if !CONFIG_ARCH_SUPPORTS_AOUT

Suppress A.OUT library support if CONFIG_ARCH_SUPPORTS_AOUT is not set.

Not all architectures support the A.OUT binfmt, so the ELF binfmt should not
be permitted to go looking for A.OUT libraries to load in such a case. Not
only that, but under such conditions A.OUT core dumps are not produced either.

To make this work, this patch also does the following:

(1) Makes the existence of the contents of linux/a.out.h contingent on
CONFIG_ARCH_SUPPORTS_AOUT.

(2) Renames dump_thread() to aout_dump_thread() as it's only called by A.OUT
core dumping code.

(3) Moves aout_dump_thread() into asm/a.out-core.h and makes it inline. This
is then included only where needed. This means that this bit of arch
code will be stored in the appropriate A.OUT binfmt module rather than
the core kernel.

(4) Drops A.OUT support for Blackfin (according to Mike Frysinger it's not
needed) and FRV.

This patch depends on the previous patch to move STACK_TOP[_MAX] out of
asm/a.out.h and into asm/processor.h as they're required whether or not A.OUT
format is available.

[jdike@addtoit.com: uml: re-remove accidentally restored code]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32a93233 06-Feb-2008 Ingo Molnar <mingo@elte.hu>

brk randomization: introduce CONFIG_COMPAT_BRK

based on similar patch from: Pavel Machek <pavel@ucw.cz>

Introduce CONFIG_COMPAT_BRK. If disabled then the kernel is free
(but not obliged to) randomize the brk area.

Heap randomization breaks ancient binaries, so we keep COMPAT_BRK
enabled by default.

Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 09c6dd3c 03-Feb-2008 Ohad Ben-Cohen <ohad@bencohen.org>

fs/binfmt_elf.c: spello fix

s/litle/little

Signed-off-by: Ohad Ben-Cohen <ohad@bencohen.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>


# 612a95b4 30-Jan-2008 Andi Kleen <andi@firstfloor.org>

x86: remove iBCS support

ibcs2 support has never been supported on 2.6 kernels as far as I know,
and if it has it must have been an external patch. Anyways, if anybody
applies an external patch they could as well readd the ibcs checking
code to the ELF loader in the same patch. But there is no reason to
keep this code running in all Linux kernels. This will save at least
two strcmps each ELF execution.

No deprecation period because it could not have been used anyway.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 4206d3aa 30-Jan-2008 Roland McGrath <roland@redhat.com>

elf core dump: notes user_regset

This modifies the ELF core dump code under #ifdef CORE_DUMP_USE_REGSET.
It changes nothing when this macro is not defined. When it's #define'd
by some arch header (e.g. asm/elf.h), the arch must support the
user_regset (linux/regset.h) interface for reading thread state.

This provides an alternate version of note segment writing that is based
purely on the user_regset interfaces. When CORE_DUMP_USE_REGSET is set,
the arch need not define macros such as ELF_CORE_COPY_REGS and ELF_ARCH.
All that information is taken from the user_regset data structures.
The core dumps come out exactly the same if arch's definitions for its
user_regset details are correct.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 3aba481f 30-Jan-2008 Roland McGrath <roland@redhat.com>

elf core dump: notes reorg

This pulls out the code for writing the notes segment of an ELF core dump
into separate functions. This cleanly isolates into one cluster of
functions everything that deals with the note formats and the hooks into
arch code to fill them. The top-level elf_core_dump function itself now
deals purely with the generic ELF format and the memory segments.

This only moves code around into functions that can be inlined away.
It should not change any behavior at all.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# bb1ad820 30-Jan-2008 Andrew Morton <akpm@linux-foundation.org>

x86: PIE executable randomization, checkpatch fixes

#39: FILE: arch/ia64/ia32/binfmt_elf32.c:229:
+elf32_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type, unsigned long unused)

WARNING: no space between function name and open parenthesis '('
#39: FILE: arch/ia64/ia32/binfmt_elf32.c:229:
+elf32_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type, unsigned long unused)

WARNING: line over 80 characters
#67: FILE: arch/x86/kernel/sys_x86_64.c:80:
+ new_begin = randomize_range(*begin, *begin + 0x02000000, 0);

ERROR: use tabs not spaces
#110: FILE: arch/x86/kernel/sys_x86_64.c:185:
+ ^I mm->cached_hole_size = 0;$

ERROR: use tabs not spaces
#111: FILE: arch/x86/kernel/sys_x86_64.c:186:
+ ^I^Imm->free_area_cache = mm->mmap_base;$

ERROR: use tabs not spaces
#112: FILE: arch/x86/kernel/sys_x86_64.c:187:
+ ^I}$

ERROR: use tabs not spaces
#141: FILE: arch/x86/kernel/sys_x86_64.c:216:
+ ^I^I/* remember the largest hole we saw so far */$

ERROR: use tabs not spaces
#142: FILE: arch/x86/kernel/sys_x86_64.c:217:
+ ^I^Iif (addr + mm->cached_hole_size < vma->vm_start)$

ERROR: use tabs not spaces
#143: FILE: arch/x86/kernel/sys_x86_64.c:218:
+ ^I^I mm->cached_hole_size = vma->vm_start - addr;$

ERROR: use tabs not spaces
#157: FILE: arch/x86/kernel/sys_x86_64.c:232:
+ ^Imm->free_area_cache = TASK_UNMAPPED_BASE;$

ERROR: need a space before the open parenthesis '('
#291: FILE: arch/x86/mm/mmap_64.c:101:
+ } else if(mmap_is_legacy()) {

WARNING: braces {} are not necessary for single statement blocks
#302: FILE: arch/x86/mm/mmap_64.c:112:
+ if (current->flags & PF_RANDOMIZE) {
+ mm->mmap_base += ((long)rnd) << PAGE_SHIFT;
+ }

WARNING: line over 80 characters
#314: FILE: fs/binfmt_elf.c:48:
+static unsigned long elf_map (struct file *, unsigned long, struct elf_phdr *, int, int, unsigned long);

WARNING: no space between function name and open parenthesis '('
#314: FILE: fs/binfmt_elf.c:48:
+static unsigned long elf_map (struct file *, unsigned long, struct elf_phdr *, int, int, unsigned long);

WARNING: line over 80 characters
#429: FILE: fs/binfmt_elf.c:438:
+ eppnt, elf_prot, elf_type, total_size);

ERROR: need space after that ',' (ctx:VxV)
#480: FILE: fs/binfmt_elf.c:939:
+ elf_prot, elf_flags,0);
^

total: 9 errors, 7 warnings, 461 lines checked
Your patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# cc503c1b 30-Jan-2008 Jiri Kosina <jkosina@suse.cz>

x86: PIE executable randomization

main executable of (specially compiled/linked -pie/-fpie) ET_DYN binaries
onto a random address (in cases in which mmap() is allowed to perform a
randomization).

The code has been extraced from Ingo's exec-shield patch
http://people.redhat.com/mingo/exec-shield/

[akpm@linux-foundation.org: fix used-uninitialsied warning]
[kamezawa.hiroyu@jp.fujitsu.com: fixed ia32 ELF on x86_64 handling]

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# c1d171a0 30-Jan-2008 Jiri Kosina <jkosina@suse.cz>

x86: randomize brk

Randomize the location of the heap (brk) for i386 and x86_64. The range is
randomized in the range starting at current brk location up to 0x02000000
offset for both architectures. This, together with
pie-executable-randomization.patch and
pie-executable-randomization-fix.patch, should make the address space
randomization on i386 and x86_64 complete.

Arjan says:

This is known to break older versions of some emacs variants, whose dumper
code assumed that the last variable declared in the program is equal to the
start of the dynamically allocated memory region.

(The dumper is the code where emacs effectively dumps core at the end of it's
compilation stage; this coredump is then loaded as the main program during
normal use)

iirc this was 5 years or so; we found this way back when I was at RH and we
first did the security stuff there (including this brk randomization). It
wasn't all variants of emacs, and it got fixed as a result (I vaguely remember
that emacs already had code to deal with it for other archs/oses, just
ifdeffed wrongly).

It's a rare and wrong assumption as a general thing, just on x86 it mostly
happened to be true (but to be honest, it'll break too if gcc does
something fancy or if the linker does a non-standard order). Still its
something we should at least document.

Note 2: afaik it only broke the emacs *build*. I'm not 100% sure about that
(it IS 5 years ago) though.

[ akpm@linux-foundation.org: deuglification ]

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 45626bb2 07-Jan-2008 Roland McGrath <roland@redhat.com>

core dump: real_parent ppid

The pr_ppid field reported in core dumps should match what
getppid() would have returned to that process, regardless of
whether a debugger is attached.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b488893a 19-Oct-2007 Pavel Emelyanov <xemul@openvz.org>

pid namespaces: changes to show virtual ids to user

This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a47afb0f 19-Oct-2007 Pavel Emelianov <xemul@openvz.org>

pid namespaces: round up the API

The set of functions process_session, task_session, process_group and
task_pgrp is confusing, as the names can be mixed with each other when looking
at the code for a long time.

The proposals are to
* equip the functions that return the integer with _nr suffix to
represent that fact,
* and to make all functions work with task (not process) by making
the common prefix of the same name.

For monotony the routines signal_session() and set_signal_session() are
replaced with task_session_nr() and set_task_session(), especially since they
are only used with the explicit task->signal dereference.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d68c9d6a 17-Oct-2007 Franck Bui-Huu <fbuihuu@gmail.com>

Break ELF_PLATFORM and stack pointer randomization dependency

Currently arch_align_stack() is used by fs/binfmt_elf.c to randomize
stack pointer inside a page. But this happens only if ELF_PLATFORM
symbol is defined.

ELF_PLATFORM is normally set if the architecture wants ld.so to load
implementation specific libraries for optimization. And currently a
lot of architectures just yield this symbol to NULL.

This is the case for MIPS architecture where ELF_PLATFORM is NULL but
arch_align_stack() has been redefined to do stack inside page
randomization. So in this case no randomization is actually done.

This patch breaks this dependency which seems to be useless and allows
platforms such MIPS to do the randomization.

Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4f9a58d7 17-Oct-2007 Olaf Hering <olaf@aepfle.de>

increase AT_VECTOR_SIZE to terminate saved_auxv properly

include/asm-powerpc/elf.h has 6 entries in ARCH_DLINFO. fs/binfmt_elf.c
has 14 unconditional NEW_AUX_ENT entries and 2 conditional NEW_AUX_ENT
entries. So in the worst case, saved_auxv does not get an AT_NULL entry at
the end.

The saved_auxv array must be terminated with an AT_NULL entry. Make the
size of mm_struct->saved_auxv arch dependend, based on the number of
ARCH_DLINFO entries.

Signed-off-by: Olaf Hering <olh@suse.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 82df3973 17-Oct-2007 Roland McGrath <roland@redhat.com>

Add MMF_DUMP_ELF_HEADERS

This adds the MMF_DUMP_ELF_HEADERS option to /proc/pid/coredump_filter.
This dumps the first page (only) of a private file mapping if it appears to
be a mapping of an ELF file. Including these pages in the core dump may
give sufficient identifying information to associate the original DSO and
executable file images and their debugging information with a core file in
a generic way just from its contents (e.g. when those binaries were built
with ld --build-id). I expect this to become the default behavior
eventually. Existing versions of gdb can be confused by the core dumps it
creates, so it won't enabled by default for some time to come. Soon many
people will have systems with a gdb that handle these dumps, so they can
arrange to set the bit at boot and have it inherited system-wide.

This also cleans up the checking of the MMF_DUMP_* flag bits, which did not
need to be using atomic macros.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e9073ed 17-Oct-2007 Andi Kleen <ak@novell.com>

Deprecate a.out ELF interpreters

The Linux ELF loader is quite complicated and messy code (that could
probably need a rewrite, but that's a different chapter). One particular
messy part in it is the support for non ELF a.out ld.sos. This was
originally added to make transition from a.out to ELF easier because an
a.out ELF ld.so could be still build using an older a.out toolkit. But by
now that should be fully obsolete and removing it would clean up
binfmt_elf.c up a bit.

I propose to deprecate this support and remove for 2.6.25.

Drawback is that someone still runs their system with a.out ld.so
they would need to update the ld.so when updating to a new kernel.

This patch just adds an entry to the deprecation file and a printk
warning users.

[akpm@linux-foundation.org: better warning message]
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7dc0b22e 17-Oct-2007 Neil Horman <nhorman@tuxdriver.com>

core_pattern: ignore RLIMIT_CORE if core_pattern is a pipe

For some time /proc/sys/kernel/core_pattern has been able to set its output
destination as a pipe, allowing a user space helper to receive and
intellegently process a core. This infrastructure however has some
shortcommings which can be enhanced. Specifically:

1) The coredump code in the kernel should ignore RLIMIT_CORE limitation
when core_pattern is a pipe, since file system resources are not being
consumed in this case, unless the user application wishes to save the core,
at which point the app is restricted by usual file system limits and
restrictions.

2) The core_pattern code should be able to parse and pass options to the
user space helper as an argv array. The real core limit of the uid of the
crashing proces should also be passable to the user space helper (since it
is overridden to zero when called).

3) Some miscellaneous bugs need to be cleaned up (specifically the
recognition of a recursive core dump, should the user mode helper itself
crash. Also, the core dump code in the kernel should not wait for the user
mode helper to exit, since the same context is responsible for writing to
the pipe, and a read of the pipe by the user mode helper will result in a
deadlock.

This patch:

Remove the check of RLIMIT_CORE if core_pattern is a pipe. In the event that
core_pattern is a pipe, the entire core will be fed to the user mode helper.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Cc: <martin.pitt@ubuntu.com>
Cc: <wwoods@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b20cd80 17-Oct-2007 Mark Nelson <markn@au1.ibm.com>

x86: replace NT_PRXFPREG with ELF_CORE_XFPREG_TYPE #define

Replace NT_PRXFPREG with ELF_CORE_XFPREG_TYPE in the coredump code which
allows for more flexibility in the note type for the state of 'extended
floating point' implementations in coredumps. New note types can now be
added with an appropriate #define.

This does #define ELF_CORE_XFPREG_TYPE to be NT_PRXFPREG in all
current users so there's are no change in behaviour.

This will let us use different note types on powerpc for the Altivec/VMX
state that some PowerPC cpus have (G4, PPC970, POWER6) and for the SPE
(signal processing extension) state that some embedded PowerPC cpus from
Freescale have.

Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 557ed1fa 16-Oct-2007 Nick Piggin <npiggin@suse.de>

remove ZERO_PAGE

The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
(and thus mapcounted and count towards shared rss). These writes to
the struct page could cause excessive cacheline bouncing on big
systems. There are a number of ways this could be addressed if it is
an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
much use to them except on benchmarks. All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>


# e5501492 18-Sep-2007 Michael Ellerman <michael@ellerman.id.au>

[POWERPC] spufs: Cleanup ELF coredump extra notes logic

To start with, arch_notes_size() etc. is a little too ambiguous a name for
my liking, so change the function names to be more explicit.

Calling through macros is ugly, especially with hidden parameters, so don't
do that, call the routines directly.

Use ARCH_HAVE_EXTRA_ELF_NOTES as the only flag, and based on it decide
whether we want the extern declarations or the empty versions.

Since we have empty routines, actually use them in the coredump code to
save a few #ifdefs.

We want to change the handling of foffset so that the write routine updates
foffset as it goes, instead of using file->f_pos (so that writing to a pipe
works). So pass foffset to the write routine, and for now just set it to
file->f_pos at the end of writing.

It should also be possible for the write routine to fail, so change it to
return int and treat a non-zero return as failure.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# d4e3cc38 21-Jul-2007 Andrew Morton <akpm@linux-foundation.org>

revert "PIE randomization"

There are reports of this causing userspace failures
(http://lkml.org/lkml/2007/7/20/421).

Revert.

Cc: Jan Kratochvil <honza@jikos.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Ulrich Kunitz <kune@deine-taler.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Bret Towe" <magnade@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1b59e80 19-Jul-2007 Kawai, Hidehiro <hidehiro.kawai.ez@hitachi.com>

coredump masking: ELF: enable core dump filtering

This patch enables core dump filtering for ELF-formatted core file.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b6a2fea3 19-Jul-2007 Ollie Wild <aaw@google.com>

mm: variable length argument support

Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
the old mm into the new mm.

We create the new mm before the binfmt code runs, and place the new stack at
the very top of the address space. Once the binfmt code runs and figures out
where the stack should be, we move it downwards.

It is a bit peculiar in that we have one task with two mm's, one of which is
inactive.

[a.p.zijlstra@chello.nl: limit stack size]
Signed-off-by: Ollie Wild <aaw@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
[bunk@stusta.de: unexport bprm_mm_init]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4d3b573a 16-Jul-2007 Andrew Morton <akpm@linux-foundation.org>

binfmt_elf warning fix

fs/binfmt_elf.c: In function 'load_elf_binary':
fs/binfmt_elf.c:1002: warning: 'interp_map_addr' may be used uninitialized in this function

The compiler (gcc-4.1.0) is correct, but it failed to notice that we didn't
use the resulting value.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 60bfba7e 16-Jul-2007 Jan Kratochvil <honza@jikos.cz>

PIE randomization

This patch is using mmap()'s randomization functionality in such a way that
it maps the main executable of (specially compiled/linked -pie/-fpie)
ET_DYN binaries onto a random address (in cases in which mmap() is allowed
to perform a randomization).

Origin of this patch is in exec-shield
(http://people.redhat.com/mingo/exec-shield/)

[jkosina@suse.cz: pie randomization: fix BAD_ADDR macro]
Signed-off-by: Jan Kratochvil <honza@jikos.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ef7320ed 06-Jul-2007 Michael Ellerman <michael@ellerman.id.au>

Fix elf_core_dump() when writing arch specific notes (spu coredumps)

elf_core_dump() supports dumping arch specific ELF notes, via the #define
ELF_CORE_WRITE_EXTRA_NOTES. Currently the only user of this is the powerpc
spu coredump code.

There is a bug in the handling of foffset WRT the arch notes, which causes
us to erroneously increment foffset by the size of the arch notes, leaving
a block of zeroes in the file, and causing all subsequent data in the file
to be at <supposed position> + <arch note size>. eg:

LOAD 0x050000 0x00100000 0x00000000 0x20000 0x20000 R E 0x10000

Tells us we should have a chunk of data at 0x50000. The truth is the data
is at 0x90dbc = 0x50000 + 0x40dbc (the size of the arch notes).

This bug prevents gdb from reading the core file correctly.

The simplest fix is to simply remember the size of the arch notes, and add
it to foffset after we've written the arch notes. The only drawback is
that if the arch code doesn't write as many bytes as it said it would, we
end up with a broken core dump again. For now I think that's a reasonable
requirement.

Tested on a Cell blade, gdb no longer complains about the core file being
bogus.

While I'm here I should point out that the spu coredump code does not work
if we're dumping to a pipe - we'll have to wait for 23 to fix that.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b140f251 08-May-2007 Alexey Kuznetsov <alexey@openvz.org>

Invalid return value of execve() resulting in oopses

When elf loader fails to map executable (due to memory shortage or because
binary is malformed), it can return 0. Normally, this is invisible because
process is killed with SIGKILL and it never returns to user space.

But if exec() is called from kernel thread (hotplug, whatever)
consequences are more interesting and vary depending on architecture.

i386. Nothing especially interesting, execve() just returns
with "success" :-)

x86_64. Fake zero frame is used on way to caller, RSP/RIP are loaded
with zeros, ergo... double fault.

ia64. Similar to i386, but r32...r95 are corrupted. Sometimes it
oopses due to return to zero PC, sometimes it sees NaT in
rXX and oopses due to NaT consumption.

Signed-off-by: Alexey Kuznetsov <alexey@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7e80d0d0 08-May-2007 Alexey Dobriyan <adobriyan@gmail.com>

i386: sched.h inclusion from module.h is baack

linux/module.h
-> linux/elf.h
-> asm-i386/elf.h
-> linux/utsname.h
-> linux/sched.h

Noticeably cut the number of files which are rebuild upon touching sched.h
and cut down pulled junk from every module.h inclusion.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e63340ae 08-May-2007 Randy Dunlap <randy.dunlap@oracle.com>

header cleaning: don't include smp_lock.h when not used

Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03221702 02-Apr-2007 Brian Pomerantz <bapper@piratehaven.org>

[PATCH] fix page leak during core dump

When the dump cannot occur most likely because of a full file system and
the page to be written is the zero page, the call to page_cache_release()
is missed.

Signed-off-by: Brian Pomerantz <bapper@mvista.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d1cabd63 16-Mar-2007 James Bottomley <James.Bottomley@SteelEye.com>

[PATCH] fix process crash caused by randomisation and 64k pages

This bug was seen on ppc64, but it could have occurred on any
architecture with a page size of 64k or above. The problem is that in
fs/binfmt_elf.c:randomize_stack_top() randomizes the stack to within
0x7ff pages. On 4k page machines, this is 8MB; on 64k page boxes, this
is 128MB.

The problem is that the new binary layout (selected in
arch_pick_mmap_layout) places the mapping segment 128MB or the stack
rlimit away from the top of the process memory, whichever is larger. If
you chose an rlimit of less than 128MB (most defaults are in the 8Mb
range) then you can end up having your entire stack randomized away.

The fix is to make randomize_stack_top() only steal at most 8MB, which this
patch does. However, I have to point out that even with this, your stack
rlimit might not be exactly what you get if it's > 128MB, because you're
still losing the random offset of up to 8MB.

The true fix should be to leave an explicit gap for the randomization plus
a buffer when determining mmap_base, but that would involve fixing all the
architectures.

Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9fbbd4dd 13-Feb-2007 Andi Kleen <ak@linux.intel.com>

[PATCH] x86: Don't require the vDSO for handling a.out signals

and in other strange binfmts. vDSO is not necessarily mapped there.

Signed-off-by: Andi Kleen <ak@suse.de>


# 1fb84496 26-Jan-2007 Alexey Dobriyan <adobriyan@openvz.org>

[PATCH] core-dumping unreadable binaries via PT_INTERP

Proposed patch to fix #5 in
http://www.isec.pl/vulnerabilities/isec-0017-binfmt_elf.txt
aka
http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2004-1073

To reproduce, do
* grab poc at the end of advisory.
* add line "eph.p_memsz = 4096;" after "eph.p_filesz = 4096;"
where first "4096" is something equal to or greater than 4096.
* ./poc /usr/bin/sudo && ls -l

Here I get with 2.6.20-rc5:

-rw------- 1 ad ad 102400 2007-01-15 19:17 core
---s--x--x 2 root root 101820 2007-01-15 19:15 /usr/bin/sudo

Check for MAY_READ like binfmt_misc.c does.

Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f47aef55 26-Jan-2007 Roland McGrath <roland@redhat.com>

[PATCH] i386 vDSO: use VM_ALWAYSDUMP

This patch fixes core dumps to include the vDSO vma, which is left out now.
It removes the special-case core writing macros, which were not doing the
right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the
vma; there is no need for the fixmap page to be installed. It handles the
CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from
get_gate_vma after real vmas in the same way the /proc/PID/maps code does.

This changes core dumps so they no longer include the non-PT_LOAD phdrs from
the vDSO. I made the change to add them in the first place, but in turned out
that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner
to leave them out, and just let the phdrs inside the vDSO image speak for
themselves.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e5b97dde 26-Jan-2007 Roland McGrath <roland@redhat.com>

[PATCH] Add VM_ALWAYSDUMP

This patch adds the VM_ALWAYSDUMP flag for vm_flags in vm_area_struct. This
provides a clean explicit way to have a vma always included in core dumps, as
is needed for vDSO's.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 90cb28e8 06-Jan-2007 Linus Torvalds <torvalds@woody.osdl.org>

Revert "[PATCH] binfmt_elf: randomize PIE binaries (2nd try)"

This reverts commit 59287c0913cc9a6c75712a775f6c1c1ef418ef3b.

Hugh Dickins reports that it causes random failures on x86 with SuSE
10.2, and points out

"Isn't that randomization, anywhere from 0x10000 to ELF_ET_DYN_BASE,
sure to place the ET_DYN from time to time just where the comment
says it's trying to avoid? I assume that somehow results in the error
reported."

(where the comment in question is the existing comment in the source
code about mmap/brk clashes).

Suggested-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Marcus Meissner <meissner@suse.de>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 937949d9 08-Dec-2006 Cedric Le Goater <clg@fr.ibm.com>

[PATCH] add process_session() helper routine

Replace occurences of task->signal->session by a new process_session() helper
routine.

It will be useful for pid namespaces to abstract the session pid number.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0f7fc9e4 08-Dec-2006 Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>

[PATCH] VFS: change struct file to use struct path

This patch changes struct file to use struct path instead of having
independent pointers to struct dentry and struct vfsmount, and converts all
users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

Additionally, it adds two #define's to make the transition easier for users of
the f_dentry and f_vfsmnt.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8de61e69 06-Dec-2006 David Rientjes <rientjes@cs.washington.edu>

[PATCH] fs: remove unused variable

Removed unused 'have_pt_gnu_stack' variable.

Reported by David Binderman <dcb314@hotmail.com>

Signed-off-by: David Rientjes <rientjes@cs.washington.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 386d9a7e 06-Dec-2006 Magnus Damm <magnus@valinux.co.jp>

[PATCH] elf: Always define elf_addr_t in linux/elf.h

Define elf_addr_t in linux/elf.h. The size of the type is determined using
ELF_CLASS. This allows us to remove the defines that today are spread all
over .c and .h files.

Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Cc: Daniel Jacobowitz <drow@false.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 841d5fb7 06-Dec-2006 Heiko Carstens <hca@linux.ibm.com>

[PATCH] binfmt: fix uaccess handling

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 59287c09 06-Dec-2006 Marcus Meissner <meissner@suse.de>

[PATCH] binfmt_elf: randomize PIE binaries (2nd try)

Randomizes -pie compiled binaries from 64k (0x10000) up to ELF_ET_DYN_BASE.

0 -> 64k is excluded to allow NULL ptr accesses to fail.

Signed-off-by: Marcus Meissner <meissner@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# bf1ab978 22-Nov-2006 Dwayne Grant McConnell <decimal@us.ibm.com>

[POWERPC] coredump: Add SPU elf notes to coredump.

This patch adds SPU elf notes to the coredump. It creates a separate note
for each of /regs, /fpcr, /lslr, /decr, /decr_status, /mem, /signal1,
/signal1_type, /signal2, /signal2_type, /event_mask, /event_status,
/mbox_info, /ibox_info, /wbox_info, /dma_info, /proxydma_info, /object-id.

A new macro, ARCH_HAVE_EXTRA_NOTES, was created for architectures to
specify they have extra elf core notes.

A new macro, ELF_CORE_EXTRA_NOTES_SIZE, was created so the size of the
additional notes could be calculated and added to the notes phdr entry.

A new macro, ELF_CORE_WRITE_EXTRA_NOTES, was created so the new notes
would be written after the existing notes.

The SPU coredump code resides in spufs. Stub functions are provided in the
kernel which are hooked into the spufs code which does the actual work via
register_arch_coredump_calls().

A new set of __spufs_<file>_read/get() functions was provided to allow the
coredump code to read from the spufs files without having to lock the
SPU context for each file read from.

Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Dwayne Grant McConnell <decimal@us.ibm.com>
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>


# a7a0d86f 13-Oct-2006 Petr Vandrovec <petr@vandrovec.name>

[PATCH] Fix core files so they make sense to gdb...

It is silly to use non-static variable for writting zeroes to the file.

And more seriously, foffset in core dump file dump function was incremented
too much, so some parts of core dump were shifted by size of few phdrs and
notes down, so although gdb was able to load that file, it did not make lot
of sense - in my test case data pages were shifted down by about 900 bytes.

Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7f14daa1 12-Oct-2006 Petr Vandrovec <petr@vandrovec.name>

[PATCH] Get core dump code to work...

The file based core dump code was broken by pipe changes - a relative
llseek returns the absolute file position on success, not the relative
one, so dump_seek() always failed when invoked with non-zero current
position.

Only success/failure can be tested with relative lseek, we have to trust
kernel that on success we've got right file offset. With this fix in
place I have finally real core files instead of 1KB fragments...

Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
[ Cleaned it up a bit while here - use SEEK_CUR instead of hardcoding 1 ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d025c9db 01-Oct-2006 Andi Kleen <ak@linux.intel.com>

[PATCH] Support piping into commands in /proc/sys/kernel/core_pattern

Using the infrastructure created in previous patches implement support to
pipe core dumps into programs.

This is done by overloading the existing core_pattern sysctl
with a new syntax:

|program

When the first character of the pattern is a '|' the kernel will instead
threat the rest of the pattern as a command to run. The core dump will be
written to the standard input of that program instead of to a file.

This is useful for having automatic core dump analysis without filling up
disks. The program can do some simple analysis and save only a summary of
the core dump.

The core dump proces will run with the privileges and in the name space of
the process that caused the core dump.

I also increased the core pattern size to 128 bytes so that longer command
lines fit.

Most of the changes comes from allowing core dumps without seeks. They are
fairly straight forward though.

One small incompatibility is that if someone had a core pattern previously
that started with '|' they will get suddenly new behaviour. I think that's
unlikely to be a real problem though.

Additional background:

> Very nice, do you happen to have a program that can accept this kind of
> input for crash dumps? I'm guessing that the embedded people will
> really want this functionality.

I had a cheesy demo/prototype. Basically it wrote the dump to a file again,
ran gdb on it to get a backtrace and wrote the summary to a shared directory.
Then there was a simple CGI script to generate a "top 10" crashes HTML
listing.

Unfortunately this still had the disadvantage to needing full disk space for a
dump except for deleting it afterwards (in fact it was worse because over the
pipe holes didn't work so if you have a holey address map it would require
more space).

Fortunately gdb seems to be happy to handle /proc/pid/fd/xxx input pipes as
cores (at least it worked with zsh's =(cat core) syntax), so it would be
likely possible to do it without temporary space with a simple wrapper that
calls it in the right way. I ran out of time before doing that though.

The demo prototype scripts weren't very good. If there is really interest I
can dig them out (they are currently on a laptop disk on the desk with the
laptop itself being in service), but I would recommend to rewrite them for any
serious application of this and fix the disk space problem.

Also to be really useful it should probably find a way to automatically fetch
the debuginfos (I cheated and just installed them in advance). If nobody else
does it I can probably do the rewrite myself again at some point.

My hope at some point was that desktops would support it in their builtin
crash reporters, but at least the KDE people I talked too seemed to be happy
with their user space only solution.

Alan sayeth:

I don't believe that piping as such as neccessarily the right model, but
the ability to intercept and processes core dumps from user space is asked
for by many enterprise users as well. They want to know about, capture,
analyse and process core dumps, often centrally and in automated form.

[akpm@osdl.org: loff_t != unsigned long]
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 07f3f05c 30-Sep-2006 David Howells <dhowells@redhat.com>

[PATCH] BLOCK: Move extern declarations out of fs/*.c into header files [try #6]

Create a new header file, fs/internal.h, for common definitions local to the
sources in the fs/ directory.

Move extern definitions that should be in header files from fs/*.c to
fs/internal.h or other main header files where they span directories.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 486ccb05 29-Sep-2006 Oleg Nesterov <oleg@tv-sign.ru>

[PATCH] elf_core_dump: don't take tasklist_lock

do_each_thread() is rcu-safe, and all tasks which use this ->mm must sleep
in wait_for_completion(&mm->core_done) at this point, so we can use RCU
locks.

Also, remove unneeded INIT_LIST_HEAD(new) before list_add(new, head).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3b9b8ab6 29-Sep-2006 Kirill Korotaev <dev@sw.ru>

[PATCH] Fix unserialized task->files changing

Fixed race on put_files_struct on exec with proc. Restoring files on
current on error path may lead to proc having a pointer to already kfree-d
files_struct.

->files changing at exit.c and khtread.c are safe as exit_files() makes all
things under lock.

Found during OpenVZ stress testing.

[akpm@osdl.org: add export]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8d6b5eee 26-Sep-2006 Andrew Morton <akpm@osdl.org>

[PATCH] binfmt_elf: consistently use loff_t

As David Howells <dhowells@redhat.com> points out, binfmt_elf sometimes uses
off_t, sometimes uses loff_t. Use loff_t throughout.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c16b63e0 26-Sep-2006 Andi Kleen <ak@linux.intel.com>

[PATCH] i386/x86-64: Don't randomize stack top when no randomization personality is set

Based on patch from Frank van Maarseveen <frankvm@frankvm.com>, but
extended.

Signed-off-by: Andi Kleen <ak@suse.de>


# b4cac1a0 10-Jul-2006 David Howells <dhowells@redhat.com>

[PATCH] FDPIC: Move roundup() into linux/kernel.h

Move the roundup() macro from binfmt_elf.c into linux/kernel.h as it's
generally useful.

[akpm@osdl.org: nuke all the other implementations]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ce51059b 03-Jul-2006 Chuck Ebbert <76306.1226@compuserve.com>

[PATCH] binfmt_elf: fix checks for bad address

Fix check for bad address; use macro instead of open-coding two checks.

Taken from RHEL4 kernel update.

From: Ernie Petrides <petrides@redhat.com>

For background, the BAD_ADDR() macro should return TRUE if the address is
TASK_SIZE, because that's the lowest address that is *not* valid for
user-space mappings. The macro was correct in binfmt_aout.c but was wrong
for the "equal to" case in binfmt_elf.c. There were two in-line validations
of user-space addresses in binfmt_elf.c, which have been appropriately
converted to use the corrected BAD_ADDR() macro in the patch you posted
yesterday. Note that the size checks against TASK_SIZE are okay as coded.

The additional changes that I propose are below. These are in the error
paths for bad ELF entry addresses once load_elf_binary() has already
committed to exec'ing the new image (following the tearing down of the
task's original address space).

The 1st hunk deals with the interp-side of the outer "if". There were two
problems here. The printk() should be removed because this path can be
triggered at will by a bogus interpreter image created and used by a
malicious user. Further, the error code should not be ENOEXEC, because that
causes the loop in search_binary_handler() to continue trying other exec
handlers (twice, in fact). But it's too late for this to work correctly,
because the user address space has already been torn down, and an exec()
failure cannot be returned to the user code because the code no longer
exists. The only recovery is to force a SIGSEGV, but it's best to terminate
the search loop immediately. I somewhat arbitrarily chose EINVAL as a
fallback error code, but any error returned by load_elf_interp() will
override that (but this value will never be seen by user-space).

The 2nd hunk deals with the non-interp-side of the outer "if". There were
two problems here as well. The SIGSEGV needs to be forced, because a prior
sigaction() syscall might have set the associated disposition to SIG_IGN.
And the ENOEXEC should be changed to EINVAL as described above.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Ernie Petrides <petrides@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 785d5570 23-Jun-2006 Jesper Juhl <jesper.juhl@gmail.com>

[PATCH] binflt_elf: remove more casts

Remove redundant casts from NEW_AUX_ENT() arguments in fs/binfmt_elf.c

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f4e5cc2c 23-Jun-2006 Jesper Juhl <jesper.juhl@gmail.com>

[PATCH] binfmt_elf: CodingStyle cleanup and remove some pointless casts

Do a CodingStyle cleanup of fs/binfmt_elf.c and also remove some pointless
casts of kmalloc() return values in the same file.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c89681ed 22-Jun-2006 Miklos Szeredi <miklos@szeredi.hu>

[PATCH] remove steal_locks()

This patch removes the steal_locks() function.

steal_locks() doesn't work correctly with any filesystem that does it's own
lock management, including NFS, CIFS, etc.

In addition it has weird semantics on local filesystems in case tasks
sharing file-descriptor tables are doing POSIX locking operations in
parallel to execve().

The steal_locks() function has an effect on applications doing:

clone(CLONE_FILES)
/* in child */
lock
execve
lock

POSIX locks acquired before execve (by "child", "parent" or any further
task sharing files_struct) will after the execve be owned exclusively by
"child".

According to Chris Wright some LSB/LTP kind of suite triggers without the
stealing behavior, but there's no known real-world application that would
also fail.

Apps using NPTL are not affected, since all other threads are killed before
execve.

Apps using LinuxThreads are only affected if they

- have multiple threads during exec (LinuxThreads doesn't kill other
threads, the app may do it with pthread_kill_other_threads_np())
- rely on POSIX locks being inherited across exec

Both conditions are documented, but not their interaction.

Apps using clone() natively are affected if they

- use clone(CLONE_FILES)
- rely on POSIX locks being inherited across exec

The above scenarios are unlikely, but possible.

If the patch is vetoed, there's a plan B, that involves mostly keeping the
weird stealing semantics, but changing the way lock ownership is handled so
that network and local filesystems work consistently.

That would add more complexity though, so this solution seems to be
preferred by most people.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 913bd906 25-Mar-2006 Andi Kleen <ak@linux.intel.com>

[PATCH] x86_64: Increase the variability of the process stack on 64bit architectures

8MB is not really very random, use 1GB (or more with larger page sizes)
instead.

Also use the low bits of the random generator output now instead of
throwing them away.

Only enabled on x86-64 right now. Other architectures need to add
a suitable STACK_RND_MASK

Cc: mingo@elte.hu
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 55148548 25-Mar-2006 Carsten Otte <cotte@de.ibm.com>

[PATCH] remove needless check in binfmt_elf.c

Local variable i is unsigned int and thus cannot be negative.

(akpm: unsigneds shouldn't be called `i'. This value cannot possibly be
negative anyway).

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 11b0b5ab 25-Mar-2006 Oliver Neukum <neukum@fachschaft.cup.uni-muenchen.de>

[PATCH] use kzalloc and kcalloc in core fs code

Signed-off-by: Oliver Neukum <oliver@neukum.name>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5342fba5 25-Feb-2006 Suresh Siddha <suresh.b.siddha@intel.com>

[PATCH] x86_64: Check for bad elf entry address.

Fixes a local DOS on Intel systems that lead to an endless
recursive fault. AMD machines don't seem to be affected.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 858119e1 14-Jan-2006 Arjan van de Ven <arjan@infradead.org>

[PATCH] Unlinline a bunch of other functions

Remove the "inline" keyword from a bunch of big functions in the kernel with
the goal of shrinking it by 30kb to 40kb

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 74da6cd0 10-Jan-2006 Jesper Juhl <juhl-lkml@dif.dk>

missing printk loglevel and tiny tiny whitespace change in binfmt_elf()

Patch adds a mising printk loglevel (I think KERN_WARNING is appropriate
here) in fs/binfmt_elf.c, and while I was there I made some tiny tiny tiny
adjustments to whitespacing in the neighborhood.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 792db3af 09-Jan-2006 Jesper Juhl <jesper.juhl@gmail.com>

[PATCH] fs/binfmt_elf: Remove unneeded kmalloc() return value casts

Remove unneeded casts of kmalloc() return value in binfmt_elf.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 708e9a79 08-Jan-2006 Matt Mackall <mpm@selenic.com>

[PATCH] tiny: Configure ELF core dump support

configurable support for ELF core dumps

text data bss dec hex filename
3330172 529036 190556 4049764 3dcb64 vmlinux-baseline
3325552 528912 190556 4045020 3db8dc vmlinux-no-elf

add/remove: 0/8 grow/shrink: 0/0 up/down: 0/-4424 (-4424)
function old new delta
fill_note 32 - -32
maydump 58 - -58
dump_seek 67 - -67
writenote 180 - -180
elf_dump_thread_status 274 - -274
fill_psinfo 308 - -308
fill_prstatus 466 - -466
elf_core_dump 3039 - -3039

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# dda6ebde 08-Jan-2006 David Gibson <david@gibson.dropbear.id.au>

[PATCH] Fix handling of ELF segments with zero filesize

mmap() returns -EINVAL if given a zero length, and thus elf_map() in
binfmt_elf.c does likewise if it attempts to map a (page-aligned) ELF
segment with zero filesize. Such a situation never arises with the default
linker scripts, but there's nothing inherently wrong with zero-filesize
(but non-zero memsize) ELF segments. Custom linker scripts can generate
them, and the kernel should be able to map them; this patch makes it so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f99d49ad 07-Nov-2005 Jesper Juhl <jesper.juhl@gmail.com>

[PATCH] kfree cleanup: fs

This is the fs/ part of the big kfree cleanup patch.

Remove pointless checks for NULL prior to calling kfree() in fs/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a9289728 30-Oct-2005 Eric W. Biederman <ebiederm@xmission.com>

[PATCH] Don't uselessly export task_struct to userspace in core dumps

task_struct is an internal structure to the kernel with a lot of good
information, that is probably interesting in core dumps. However there is
no way for user space to know what format that information is in making it
useless.

I grepped the GDB 6.3 source code and NT_TASKSTRUCT while defined is not
used anywhere else. So I would be surprised if anyone notices it is
missing.

In addition exporting kernel pointers to all the interesting kernel data
structures sounds like the very definition of an information leak. I
haven't a clue what someone with evil intentions could do with that
information, but in any attack against the kernel it looks like this is the
perfect tool for aiming that attack.

So since NT_TASKSTRUCT is useless as currently defined and is potentially
dangerous, let's just not export it.

(akpm: Daniel Jacobowitz <dan@debian.org> "would be amazed" if anything was
using NT_TASKSTRUCT).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 404351e6 29-Oct-2005 Hugh Dickins <hugh@veritas.com>

[PATCH] mm: mm_init set_mm_counters

How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but
that's not so good if an mm_counter_t is a special type. And how is rss
initialized? By set_mm_counter, all over the place. Come on, we just need to
initialize them both at once by set_mm_counter in mm_init (which follows the
memcpy when forking).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6de50517 11-Oct-2005 akpm@osdl.org <akpm@osdl.org>

[PATCH] binfmt_elf bss padding fix

Nir Tzachar <tzachar@cs.bgu.ac.il> points out that if an ELF file specifies a
zero-length bss at a whacky address, we cannot load that binary because
padzero() tries to zero out the end of the page at the whacky address, and
that may not be writeable.

See also http://bugzilla.kernel.org/show_bug.cgi?id=5411

So teach load_elf_binary() to skip the bss settng altogether if the elf file
has a zero-length bss segment.

Cc: Roland McGrath <roland@redhat.com>
Cc: Daniel Jacobowitz <dan@debian.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1363c3cd 21-Jun-2005 Wolfgang Wander <wwc@rentec.com>

[PATCH] Avoiding mmap fragmentation

Ingo recently introduced a great speedup for allocating new mmaps using the
free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
causes huge performance increases in thread creation.

The downside of this patch is that it does lead to fragmentation in the
mmap-ed areas (visible via /proc/self/maps), such that some applications
that work fine under 2.4 kernels quickly run out of memory on any 2.6
kernel.

The problem is twofold:

1) the free_area_cache is used to continue a search for memory where
the last search ended. Before the change new areas were always
searched from the base address on.

So now new small areas are cluttering holes of all sizes
throughout the whole mmap-able region whereas before small holes
tended to close holes near the base leaving holes far from the base
large and available for larger requests.

2) the free_area_cache also is set to the location of the last
munmap-ed area so in scenarios where we allocate e.g. five regions of
1K each, then free regions 4 2 3 in this order the next request for 1K
will be placed in the position of the old region 3, whereas before we
appended it to the still active region 1, placing it at the location
of the old region 2. Before we had 1 free region of 2K, now we only
get two free regions of 1K -> fragmentation.

The patch addresses thes issues by introducing yet another cache descriptor
cached_hole_size that contains the largest known hole size below the
current free_area_cache. If a new request comes in the size is compared
against the cached_hole_size and if the request can be filled with a hole
below free_area_cache the search is started from the base instead.

The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
(earlier posted) leakme.c test program terminates after 50000+ iterations
with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
(as expected) with thread creation, Ingo's test_str02 with 20000 threads
requires 0.7s system time.

Taking out Ingo's patch (un-patch available per request) by basically
deleting all mentions of free_area_cache from the kernel and starting the
search for new memory always at the respective bases we observe: leakme
terminates successfully with 11 distinctive hardly fragmented areas in
/proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
time for Ingo's test_str02 with 20000 threads.

Now - drumroll ;-) the appended patch works fine with leakme: it ends with
only 7 distinct areas in /proc/self/maps and also thread creation seems
sufficiently fast with 0.71s for 20000 threads.

Signed-off-by: Wolfgang Wander <wwc@rentec.com>
Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5db92850 15-Jun-2005 Daniel Jacobowitz <dan@debian.org>

[PATCH] Fix large core dumps with a 32-bit off_t

The ELF core dump code has one use of off_t when writing out segments.
Some of the segments may be passed the 2GB limit of an off_t, even on a
32-bit system, so it's important to use loff_t instead. This fixes a
corrupted core dump in the bigcore test in GDB's testsuite.

Signed-off-by: Daniel Jacobowitz <dan@codesourcery.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a84a5059 11-May-2005 Greg Kroah-Hartman <gregkh@suse.de>

[PATCH] fix Linux kernel ELF core dump privilege elevation

As reported by Paul Starzetz <ihaquer@isec.pl>

Reference: CAN-2005-1263

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


# 18c8baff 28-Apr-2005 Roland McGrath <roland@redhat.com>

[PATCH] Fix error recovery path for arch_setup_additional_pages

If arch_setup_additional_pages fails, the error path will do some double-frees.
This fixes it.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 547ee84c 16-Apr-2005 Benjamin Herrenschmidt <benh@kernel.crashing.org>

[PATCH] ppc64: Improve mapping of vDSO

This patch reworks the way the ppc64 is mapped in user memory by the kernel
to make it more robust against possible collisions with executable
segments. Instead of just whacking a VMA at 1Mb, I now use
get_unmapped_area() with a hint, and I moved the mapping of the vDSO to
after the mapping of the various ELF segments and of the interpreter, so
that conflicts get caught properly (it still has to be before
create_elf_tables since the later will fill the AT_SYSINFO_EHDR with the
proper address).

While I was at it, I also changed the 32 and 64 bits vDSO's to link at
their "natural" address of 1Mb instead of 0. This is the address where
they are normally mapped in absence of conflict. By doing so, it should be
possible to properly prelink one it's been verified to work on glibc.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!