History log of /linux-master/fs/aio.c
Revision Date Author Comments
# 3648951c 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

aio: Convert to migrate_folio

Use a folio throughout this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 164f4064 22-May-2022 Al Viro <viro@zeniv.linux.org.uk>

keep iocb_flags() result cached in struct file

* calculate at the time we set FMODE_OPENED (do_dentry_open() for normal
opens, alloc_file() for pipe()/socket()/etc.)
* update when handling F_SETFL
* keep in a new field - file->f_iocb_flags; since that thing is needed only
before the refcount reaches zero, we can put it into the same anon union
where ->f_rcuhead and ->f_llist live - those are used only after refcount
reaches zero.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 61e02cdb 14-Mar-2022 Lukas Bulwahn <lukas.bulwahn@gmail.com>

aio: drop needless assignment in aio_read()

Commit 84c4e1f89fef ("aio: simplify - and fix - fget/fput for io_submit()")
refactored aio_read() and some error cases into early return, which made
some intermediate assignment of the return variable needless.

Drop this needless assignment in aio_read().

No functional change. No change in resulting object code.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 41d36a9f 07-Mar-2022 Christoph Hellwig <hch@lst.de>

fs: remove kiocb.ki_hint

This field is entirely unused now except for a tracepoint in f2fs, so
remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220308060529.736277-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 46de8b97 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio

This is a mechanical change.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs


# 86b12b6c 21-Jan-2022 Xiaoming Ni <nixiaoming@huawei.com>

aio: move aio sysctl to aio.c

The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

Move aio sysctl to aio.c and use the new register_sysctl_init() to
register the sysctl interface for aio.

[mcgrof@kernel.org: adjust commit log to justify the move]

Link: https://lkml.kernel.org/r/20211123202347.818157-9-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4b374986 13-Sep-2021 Xie Yongji <xieyongji@bytedance.com>

aio: Fix incorrect usage of eventfd_signal_allowed()

We should defer eventfd_signal() to the workqueue when
eventfd_signal_allowed() return false rather than return
true.

Fixes: b542e383d8c0 ("eventfd: Make signal recursion protection a task bit")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20210913111928.98-1-xieyongji@bytedance.com
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 50252e4b 08-Dec-2021 Eric Biggers <ebiggers@google.com>

aio: fix use-after-free due to missing POLLFREE handling

signalfd_poll() and binder_poll() are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case. This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution. This solution is for the queue to be cleared
before it is freed, by sending a POLLFREE notification to all waiters.

Unfortunately, only eventpoll handles POLLFREE. A second type of
non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't
handle POLLFREE. This allows a use-after-free to occur if a signalfd or
binder fd is polled with aio poll, and the waitqueue gets freed.

Fix this by making aio poll handle POLLFREE.

A patch by Ramji Jiyani <ramjiyani@google.com>
(https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com)
tried to do this by making aio_poll_wake() always complete the request
inline if POLLFREE is seen. However, that solution had two bugs.
First, it introduced a deadlock, as it unconditionally locked the aio
context while holding the waitqueue lock, which inverts the normal
locking order. Second, it didn't consider that POLLFREE notifications
are missed while the request has been temporarily de-queued.

The second problem was solved by my previous patch. This patch then
properly fixes the use-after-free by handling POLLFREE in a
deadlock-free way. It does this by taking advantage of the fact that
freeing of the waitqueue is RCU-delayed, similar to what eventpoll does.

Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 363bee27 08-Dec-2021 Eric Biggers <ebiggers@google.com>

aio: keep poll requests on waitqueue until completed

Currently, aio_poll_wake() will always remove the poll request from the
waitqueue. Then, if aio_poll_complete_work() sees that none of the
polled events are ready and the request isn't cancelled, it re-adds the
request to the waitqueue. (This can easily happen when polling a file
that doesn't pass an event mask when waking up its waitqueue.)

This is fundamentally broken for two reasons:

1. If a wakeup occurs between vfs_poll() and the request being
re-added to the waitqueue, it will be missed because the request
wasn't on the waitqueue at the time. Therefore, IOCB_CMD_POLL
might never complete even if the polled file is ready.

2. When the request isn't on the waitqueue, there is no way to be
notified that the waitqueue is being freed (which happens when its
lifetime is shorter than the struct file's). This is supposed to
happen via the waitqueue entries being woken up with POLLFREE.

Therefore, leave the requests on the waitqueue until they are actually
completed (or cancelled). To keep track of when aio_poll_complete_work
needs to be scheduled, use new fields in struct poll_iocb. Remove the
'done' field which is now redundant.

Note that this is consistent with how sys_poll() and eventpoll work;
their wakeup functions do *not* remove the waitqueue entries.

Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 6b19b766 21-Oct-2021 Jens Axboe <axboe@kernel.dk>

fs: get rid of the res2 iocb->ki_complete argument

The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.

Now that everybody is passing in zero, kill off the extra argument.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6446c4fb 19-Sep-2021 Len Baker <len.baker@gmx.com>

aio: Prefer struct_size over open coded arithmetic

As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + size * count" in the kzalloc() function.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# b542e383 29-Jul-2021 Thomas Gleixner <tglx@linutronix.de>

eventfd: Make signal recursion protection a task bit

The recursion protection for eventfd_signal() is based on a per CPU
variable and relies on the !RT semantics of spin_lock_irqsave() for
protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
disables preemption nor interrupts which allows the spin lock held section
to be preempted. If the preempting task invokes eventfd_signal() as well,
then the recursion warning triggers.

Paolo suggested to protect the per CPU variable with a local lock, but
that's heavyweight and actually not necessary. The goal of this protection
is to prevent the task stack from overflowing, which can be achieved with a
per task recursion protection as well.

Replace the per CPU variable with a per task bit similar to other recursion
protection bits like task_struct::in_page_owner. This works on both !RT and
RT kernels and removes as a side effect the extra per CPU storage.

No functional change for !RT kernels.

Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx


# 14d07113 29-Apr-2021 Brian Geffon <bgeffon@google.com>

Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"

This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99.

As discussed in [1] this commit was a no-op because the mapping type was
checked in vma_to_resize before move_vma is ever called. This meant that
vm_ops->mremap() would never be called on such mappings. Furthermore,
we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
mappings, and these special mappings are still protected by the existing
check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EINVAL.

1. https://lkml.org/lkml/2020/12/28/2340

Link: https://lkml.kernel.org/r/20210323182520.2712101-2-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Alejandro Colomar <alx.manpages@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cd544fd1 14-Dec-2020 Dmitry Safonov <0x7f454c46@gmail.com>

mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio

As kernel expect to see only one of such mappings, any further operations
on the VMA-copy may be unexpected by the kernel. Maybe it's being on the
safe side, but there doesn't seem to be any expected use-case for this, so
restrict it now.

Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a3c84b6 10-Nov-2020 Darrick J. Wong <darrick.wong@oracle.com>

vfs: separate __sb_start_write into blocking and non-blocking helpers

Break this function into two helpers so that it's obvious that the
trylock versions return a value that must be checked, and the blocking
versions don't require that. While we're at it, clean up the return
type mismatch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# e8f147dc 03-Nov-2020 Thomas Gleixner <tglx@linutronix.de>

fs: Remove asm/kmap_types.h includes

Historical leftovers from the time where kmap() had fixed slots.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20201103095856.870272797@linutronix.de


# 89cd35c5 24-Sep-2020 Christoph Hellwig <hch@lst.de>

iov_iter: transparently handle compat iovecs in import_iovec

Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.

This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# df561f66 23-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

treewide: Use fallthrough pseudo-keyword

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# 45e55300 07-Aug-2020 Peter Collingbourne <pcc@google.com>

mm: remove unnecessary wrapper function do_mmap_pgoff()

The current split between do_mmap() and do_mmap_pgoff() was introduced in
commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
do_mmap_pgoff()") to support MPX.

The wrapper function do_mmap_pgoff() always passed 0 as the value of the
vm_flags argument to do_mmap(). However, MPX support has subsequently
been removed from the kernel and there were no more direct callers of
do_mmap(); all calls were going via do_mmap_pgoff().

Simplify the code by removing do_mmap_pgoff() and changing all callers to
directly call do_mmap(), which now no longer takes a vm_flags argument.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 241cb28e 28-May-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

aio: Replace zero-length array with flexible-array

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# 9bf5b9eb 10-Jun-2020 Christoph Hellwig <hch@lst.de>

kernel: move use_mm/unuse_mm to kthread.c

Patch series "improve use_mm / unuse_mm", v2.

This series improves the use_mm / unuse_mm interface by better documenting
the assumptions, and my taking the set_fs manipulations spread over the
callers into the core API.

This patch (of 3):

Use the proper API instead.

Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de

These helpers are only for use with kernel threads, and I will tie them
more into the kthread infrastructure going forward. Also move the
prototypes to kthread.h - mmu_context.h was a little weird to start with
as it otherwise contains very low-level MM bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8ed45c5 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 530f32fc 14-May-2020 Miklos Szeredi <mszeredi@redhat.com>

aio: fix async fsync creds

Avi Kivity reports that on fuse filesystems running in a user namespace
asyncronous fsync fails with EOVERFLOW.

The reason is that f_ops->fsync() is called with the creds of the kthread
performing aio work instead of the creds of the process originally
submitting IOCB_CMD_FSYNC.

Fuse sends the creds of the caller in the request header and it needs to
translate the uid and gid into the server's user namespace. Since the
kthread is running in init_user_ns, the translation will fail and the
operation returns an error.

It can be argued that fsync doesn't actually need any creds, but just
zeroing out those fields in the header (as with requests that currently
don't take creds) is a backward compatibility risk.

Instead of working around this issue in fuse, solve the core of the problem
by calling the filesystem with the proper creds.

Reported-by: Avi Kivity <avi@scylladb.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Fixes: c9582eb0ff7d ("fuse: Fail all requests with invalid uids or gids")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 01d7a356 03-Feb-2020 Jens Axboe <axboe@kernel.dk>

aio: prevent potential eventfd recursion on poll

If we have nested or circular eventfd wakeups, then we can deadlock if
we run them inline from our poll waitqueue wakeup handler. It's also
possible to have very long chains of notifications, to the extent where
we could risk blowing the stack.

Check the eventfd recursion count before calling eventfd_signal(). If
it's non-zero, then punt the signaling to async context. This is always
safe, as it takes us out-of-line in terms of stack and locking context.

Cc: stable@vger.kernel.org # 4.19+
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3ca47e95 23-Apr-2019 Arnd Bergmann <arnd@arndb.de>

y2038: remove CONFIG_64BIT_TIME

The CONFIG_64BIT_TIME option is defined on all architectures, and can
be removed for simplicity now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 97eba80f 20-Aug-2019 Guillem Jover <guillem@hadrons.org>

aio: Fix io_pgetevents() struct __compat_aio_sigset layout

This type is used to pass the sigset_t from userland to the kernel,
but it was using the kernel native pointer type for the member
representing the compat userland pointer to the userland sigset_t.

This messes up the layout, and makes the kernel eat up both the
userland pointer and the size members into the kernel pointer, and
then reads garbage into the kernel sigsetsize. Which makes the sigset_t
size consistency check fail, and consequently the syscall always
returns -EINVAL.

This breaks both libaio and strace on 32-bit userland running on 64-bit
kernels. And there are apparently no users in the wild of the current
broken layout (at least according to codesearch.debian.org and a brief
check over github.com search). So it looks safe to fix this directly
in the kernel, instead of either letting userland deal with this
permanently with the additional overhead or trying to make the syscall
infer what layout userland used, even though this is also being worked
around in libaio to temporarily cope with kernels that have not yet
been fixed.

We use a proper compat_uptr_t instead of a compat_sigset_t pointer.

Fixes: 7a074e96dee6 ("aio: implement io_pgetevents")
Signed-off-by: Guillem Jover <guillem@hadrons.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 37109694 18-Jul-2019 Keith Busch <kbusch@kernel.org>

mm: migrate: remove unused mode argument

migrate_page_move_mapping() doesn't use the mode argument. Remove it
and update callers accordingly.

Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b772434b 16-Jul-2019 Oleg Nesterov <oleg@redhat.com>

signal: simplify set_user_sigmask/restore_user_sigmask

task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths. This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 97abc889 28-Jun-2019 Oleg Nesterov <oleg@redhat.com>

signal: remove the wrong signal_pending() check in restore_user_sigmask()

This is the minimal fix for stable, I'll send cleanups later.

Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.

Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.

Eric said:

: For clarity. I don't think this is required by posix, or fundamentally to
: remove the races in select. It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented. The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?

Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Eric Wong <e@80x24.org>
Tested-by: Eric Wong <e@80x24.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 87e5e6da 14-May-2019 Jens Axboe <axboe@kernel.dk>

uio: make import_iovec()/compat_import_iovec() return bytes on success

Currently these functions return < 0 on error, and 0 for success.
Change that so that we return < 0 on error, but number of bytes
for success.

Some callers already treat the return value that way, others need a
slight tweak.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 52db59df 25-Mar-2019 David Howells <dhowells@redhat.com>

vfs: Convert aio to use the new mount API

Convert the aio filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Benjamin LaHaise <bcrl@kvack.org>
cc: linux-aio@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1f58bb18 20-May-2019 Al Viro <viro@zeniv.linux.org.uk>

mount_pseudo(): drop 'name' argument, switch to d_make_root()

Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work. These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root. However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it. Including those that
didn't *have* any non-root dentries to start with...

All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6af1c849 04-Apr-2019 Wei Yongjun <weiyongjun1@huawei.com>

aio: use kmem_cache_free() instead of kfree()

memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().

Fixes: fa0ca2aee3be ("deal with get_reqs_available() in aio_get_req() itself")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 18bfb9c6 03-Apr-2019 Dan Carpenter <dan.carpenter@oracle.com>

aio: Fix an error code in __io_submit_one()

This accidentally returns the wrong variable. The "req->ki_eventfd"
pointer is NULL so this return success.

Fixes: 7316b49c2a11 ("aio: move sanity checks and request allocation to io_submit_one()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7316b49c 06-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

aio: move sanity checks and request allocation to io_submit_one()

makes for somewhat cleaner control flow in __io_submit_one()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fa0ca2ae 06-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

deal with get_reqs_available() in aio_get_req() itself

simplifies the caller

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 74259703 06-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

aio: move dropping ->ki_eventfd into iocb_destroy()

no reason to duplicate that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 958c13ce 06-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

make aio_read()/aio_write() return int

that ssize_t is a rudiment of earlier calling conventions; it's been
used only to pass 0 and -E... since last autumn.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# af5c72b1 07-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

Fix aio_poll() races

aio_poll() has to cope with several unpleasant problems:
* requests that might stay around indefinitely need to
be made visible for io_cancel(2); that must not be done to
a request already completed, though.
* in cases when ->poll() has placed us on a waitqueue,
wakeup might have happened (and request completed) before ->poll()
returns.
* worse, in some early wakeup cases request might end
up re-added into the queue later - we can't treat "woken up and
currently not in the queue" as "it's not going to stick around
indefinitely"
* ... moreover, ->poll() might have decided not to
put it on any queues to start with, and that needs to be distinguished
from the previous case
* ->poll() might have tried to put us on more than one queue.
Only the first will succeed for aio poll, so we might end up missing
wakeups. OTOH, we might very well notice that only after the
wakeup hits and request gets completed (all before ->poll() gets
around to the second poll_wait()). In that case it's too late to
decide that we have an error.

req->woken was an attempt to deal with that. Unfortunately, it was
broken. What we need to keep track of is not that wakeup has happened -
the thing might come back after that. It's that async reference is
already gone and won't come back, so we can't (and needn't) put the
request on the list of cancellables.

The easiest case is "request hadn't been put on any waitqueues"; we
can tell by seeing NULL apt.head, and in that case there won't be
anything async. We should either complete the request ourselves
(if vfs_poll() reports anything of interest) or return an error.

In all other cases we get exclusion with wakeups by grabbing the
queue lock.

If request is currently on queue and we have something interesting
from vfs_poll(), we can steal it and complete the request ourselves.

If it's on queue and vfs_poll() has not reported anything interesting,
we either put it on the cancellable list, or, if we know that it
hadn't been put on all queues ->poll() wanted it on, we steal it and
return an error.

If it's _not_ on queue, it's either been already dealt with (in which
case we do nothing), or there's aio_poll_complete_work() about to be
executed. In that case we either put it on the cancellable list,
or, if we know it hadn't been put on all queues ->poll() wanted it on,
simulate what cancel would've done.

It's a lot more convoluted than I'd like it to be. Single-consumer APIs
suck, and unfortunately aio is not an exception...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2bb874c0 07-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

aio: store event at final iocb_put()

Instead of having aio_complete() set ->ki_res.{res,res2}, do that
explicitly in its callers, drop the reference (as aio_complete()
used to do) and delay the rest until the final iocb_put().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a9339b78 07-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

aio: keep io_event in aio_kiocb

We want to separate forming the resulting io_event from putting it
into the ring buffer.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 833f4154 11-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

aio: fold lookup_kiocb() into its sole caller

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b53119f1 06-Mar-2019 Linus Torvalds <torvalds@linux-foundation.org>

pin iocb through aio.

aio_poll() is not the only case that needs file pinned; worse, while
aio_read()/aio_write() can live without pinning iocb itself, the
proof is rather brittle and can easily break on later changes.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 84c4e1f8 03-Mar-2019 Linus Torvalds <torvalds@linux-foundation.org>

aio: simplify - and fix - fget/fput for io_submit()

Al Viro root-caused a race where the IOCB_CMD_POLL handling of
fget/fput() could cause us to access the file pointer after it had
already been freed:

"In more details - normally IOCB_CMD_POLL handling looks so:

1) io_submit(2) allocates aio_kiocb instance and passes it to
aio_poll()

2) aio_poll() resolves the descriptor to struct file by req->file =
fget(iocb->aio_fildes)

3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).

4) aio_poll() calls vfs_poll(). After sanity checks (basically,
"poll_wait() had been called and only once") it locks the queue.
That's what the extra reference to iocb had been for - we know we
can safely access it.

5) With queue locked, we check if ->woken has already been set to
true (by aio_poll_wake()) and, if it had been, we unlock the
queue, drop a reference to aio_kiocb and bugger off - at that
point it's a responsibility to aio_poll_wake() and the stuff
called/scheduled by it. That code will drop the reference to file
in req->file, along with the other reference to our aio_kiocb.

6) otherwise, we see whether we need to wait. If we do, we unlock the
queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with
the other reference to aio_kiocb

7) otherwise we remove ourselves from waitqueue (still under the
queue lock), so that wakeup won't get us. No async activity will
be happening, so we can safely drop req->file and iocb ourselves.

If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking
safely. And we don't touch ->file if we detect that case.

However, vfs_poll() most certainly *does* touch the file it had been
given. So wakeup coming while we are still in ->poll() might end up
doing fput() on that file. That case is not too rare, and usually we
are saved by the still present reference from descriptor table - that
fput() is not the final one.

But if another thread closes that descriptor right after our fget()
and wakeup does happen before ->poll() returns, we are in trouble -
final fput() done while we are in the middle of a method:

Al also wrote a patch to take an extra reference to the file descriptor
to fix this, but I instead suggested we just streamline the whole file
pointer handling by submit_io() so that the generic aio submission code
simply keeps the file pointer around until the aio has completed.

Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3d6a18d 08-Feb-2019 Bart Van Assche <bvanassche@acm.org>

aio: Fix locking in aio_poll()

wake_up_locked() may but does not have to be called with interrupts
disabled. Since the fuse filesystem calls wake_up_locked() without
disabling interrupts aio_poll_wake() may be called with interrupts
enabled. Since the kioctx.ctx_lock may be acquired from IRQ context,
all code that acquires that lock from thread context must disable
interrupts. Hence change the spin_trylock() call in aio_poll_wake()
into a spin_trylock_irqsave() call. This patch fixes the following
lockdep complaint:

=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.0.0-rc4-next-20190131 #23 Not tainted
-----------------------------------------------------
syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline]
0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline]
0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline]
0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908

and this task is already holding:
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}

... which became SOFTIRQ-irq-safe at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
_raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
spin_lock_irq include/linux/spinlock.h:354 [inline]
free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2486 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
__do_softirq+0x266/0x95a kernel/softirq.c:292
run_ksoftirqd kernel/softirq.c:654 [inline]
run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
kthread+0x357/0x430 kernel/kthread.c:247
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352

to a SOFTIRQ-irq-unsafe lock:
(&fiq->waitq){+.+.}

... which became SOFTIRQ-irq-unsafe at:
...
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
fuse_send_init fs/fuse/inode.c:989 [inline]
fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
mount_nodev+0x68/0x110 fs/super.c:1392
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
vfs_get_tree+0x123/0x450 fs/super.c:1481
do_new_mount fs/namespace.c:2610 [inline]
do_mount+0x1436/0x2c40 fs/namespace.c:2932
ksys_mount+0xdb/0x150 fs/namespace.c:3148
__do_sys_mount fs/namespace.c:3162 [inline]
__se_sys_mount fs/namespace.c:3159 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&fiq->waitq);
local_irq_disable();
lock(&(&ctx->ctx_lock)->rlock);
lock(&fiq->waitq);
<Interrupt>
lock(&(&ctx->ctx_lock)->rlock);

*** DEADLOCK ***

1 lock held by syz-executor2/13779:
#0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
#0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
#0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
#0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908

the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&(&ctx->ctx_lock)->rlock){..-.} {
IN-SOFTIRQ-W at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
_raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
spin_lock_irq include/linux/spinlock.h:354 [inline]
free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2486 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
__do_softirq+0x266/0x95a kernel/softirq.c:292
run_ksoftirqd kernel/softirq.c:654 [inline]
run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
kthread+0x357/0x430 kernel/kthread.c:247
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
INITIAL USE at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
_raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
spin_lock_irq include/linux/spinlock.h:354 [inline]
__do_sys_io_cancel fs/aio.c:2052 [inline]
__se_sys_io_cancel fs/aio.c:2035 [inline]
__x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
}
... key at: [<ffffffff8a574140>] __key.52370+0x0/0x40
... acquired at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
aio_poll fs/aio.c:1772 [inline]
__io_submit_one fs/aio.c:1875 [inline]
io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
__do_sys_io_submit fs/aio.c:1953 [inline]
__se_sys_io_submit fs/aio.c:1923 [inline]
__x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

the dependencies between the lock to be acquired
and SOFTIRQ-irq-unsafe lock:
-> (&fiq->waitq){+.+.} {
HARDIRQ-ON-W at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
fuse_send_init fs/fuse/inode.c:989 [inline]
fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
mount_nodev+0x68/0x110 fs/super.c:1392
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
vfs_get_tree+0x123/0x450 fs/super.c:1481
do_new_mount fs/namespace.c:2610 [inline]
do_mount+0x1436/0x2c40 fs/namespace.c:2932
ksys_mount+0xdb/0x150 fs/namespace.c:3148
__do_sys_mount fs/namespace.c:3162 [inline]
__se_sys_mount fs/namespace.c:3159 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
SOFTIRQ-ON-W at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
fuse_send_init fs/fuse/inode.c:989 [inline]
fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
mount_nodev+0x68/0x110 fs/super.c:1392
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
vfs_get_tree+0x123/0x450 fs/super.c:1481
do_new_mount fs/namespace.c:2610 [inline]
do_mount+0x1436/0x2c40 fs/namespace.c:2932
ksys_mount+0xdb/0x150 fs/namespace.c:3148
__do_sys_mount fs/namespace.c:3162 [inline]
__se_sys_mount fs/namespace.c:3159 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
INITIAL USE at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
fuse_send_init fs/fuse/inode.c:989 [inline]
fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
mount_nodev+0x68/0x110 fs/super.c:1392
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
vfs_get_tree+0x123/0x450 fs/super.c:1481
do_new_mount fs/namespace.c:2610 [inline]
do_mount+0x1436/0x2c40 fs/namespace.c:2932
ksys_mount+0xdb/0x150 fs/namespace.c:3148
__do_sys_mount fs/namespace.c:3162 [inline]
__se_sys_mount fs/namespace.c:3159 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
}
... key at: [<ffffffff8a60dec0>] __key.43450+0x0/0x40
... acquired at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
aio_poll fs/aio.c:1772 [inline]
__io_submit_one fs/aio.c:1875 [inline]
io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
__do_sys_io_submit fs/aio.c:1953 [inline]
__se_sys_io_submit fs/aio.c:1923 [inline]
__x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

stack backtrace:
CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline]
check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605
check_irq_usage kernel/locking/lockdep.c:1650 [inline]
check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline]
check_prev_add kernel/locking/lockdep.c:1860 [inline]
check_prevs_add kernel/locking/lockdep.c:1968 [inline]
validate_chain kernel/locking/lockdep.c:2339 [inline]
__lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
aio_poll fs/aio.c:1772 [inline]
__io_submit_one fs/aio.c:1875 [inline]
io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
__do_sys_io_submit fs/aio.c:1953 [inline]
__se_sys_io_submit fs/aio.c:1923 [inline]
__x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
Fixes: e8693bcfa0b4 ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
[ bvanassche: added a comment ]
Reluctantly-Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8dabe724 06-Jan-2019 Arnd Bergmann <arnd@arndb.de>

y2038: syscalls: rename y2038 compat syscalls

A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.

The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.

Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.

In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# ec51f8ee 05-Feb-2019 Mike Marshall <hubcap@omnibond.com>

aio: initialize kiocb private in case any filesystems expect it.

A recent optimization had left private uninitialized.

Fixes: 2bc4ca9bb600 ("aio: don't zero entire aio_kiocb aio_get_req()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ab41ee68 28-Dec-2018 Jan Kara <jack@suse.cz>

mm: migrate: drop unused argument of migrate_page_move_mapping()

All callers of migrate_page_move_mapping() now pass NULL for 'head'
argument. Drop it.

Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 875736bb 20-Nov-2018 Jens Axboe <axboe@kernel.dk>

aio: abstract out io_event filler helper

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 88a6f18b 24-Nov-2018 Jens Axboe <axboe@kernel.dk>

aio: split out iocb copy from io_submit_one()

In preparation of handing in iocbs in a different fashion as well. Also
make it clear that the iocb being passed in isn't modified, by marking
it const throughout.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 71ebc6fe 24-Nov-2018 Jens Axboe <axboe@kernel.dk>

aio: use iocb_put() instead of open coding it

Replace the percpu_ref_put() + kmem_cache_free() with a call to
iocb_put() instead.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# a79d40e9 04-Dec-2018 Jens Axboe <axboe@kernel.dk>

aio: only use blk plugs for > 2 depth submissions

Plugging is meant to optimize submission of a string of IOs, if we don't
have more than 2 being submitted, don't bother setting up a plug.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2bc4ca9b 04-Dec-2018 Jens Axboe <axboe@kernel.dk>

aio: don't zero entire aio_kiocb aio_get_req()

It's 192 bytes, fairly substantial. Most items don't need to be cleared,
especially not upfront. Clear the ones we do need to clear, and leave
the other ones for setup when the iocb is prepared and submitted.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 432c7997 19-Nov-2018 Christoph Hellwig <hch@lst.de>

aio: separate out ring reservation from req allocation

This is in preparation for certain types of IO not needing a ring
reserveration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# bc9bff61 06-Nov-2018 Jens Axboe <axboe@kernel.dk>

aio: use assigned completion handler

We know this is a read/write request, but in preparation for
having different kinds of those, ensure that we call the assigned
handler instead of assuming it's aio_complete_rq().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0afa9964 10-Dec-2018 Jeff Moyer <jmoyer@redhat.com>

aio: fix spectre gadget in lookup_ioctx

Matthew pointed out that the ioctx_table is susceptible to spectre v1,
because the index can be controlled by an attacker. The below patch
should mitigate the attack for all of the aio system calls.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a538e3ff 10-Dec-2018 Jeff Moyer <jmoyer@redhat.com>

aio: fix spectre gadget in lookup_ioctx

Matthew pointed out that the ioctx_table is susceptible to spectre v1,
because the index can be controlled by an attacker. The below patch
should mitigate the attack for all of the aio system calls.

Cc: stable@vger.kernel.org
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 7a35397f 19-Sep-2018 Deepa Dinamani <deepa.kernel@gmail.com>

io_pgetevents: use __kernel_timespec

struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update io_pgetevents interfaces to use struct __kernel_timespec.

sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:

New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)

Native 64 bit(unchanged) and native 32 bit : sys_io_pgetevents
Compat : compat_sys_io_pgetevents_time64

Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)

Native 32 bit : sys_io_pgetevents_time32
Compat : compat_sys_io_pgetevents

Note that io_getevents syscalls do not have a y2038 safe solution.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 854a6ed5 19-Sep-2018 Deepa Dinamani <deepa.kernel@gmail.com>

signal: Add restore_user_sigmask()

Refactor the logic to restore the sigmask before the syscall
returns into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during
the execution and restored after the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# ded653cc 19-Sep-2018 Deepa Dinamani <deepa.kernel@gmail.com>

signal: Add set_user_sigmask()

Refactor reading sigset from userspace and updating sigmask
into an api.

This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during,
and restored after, the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll,
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.

Note that the calls to sigprocmask() ignored the return value
from the api as the function only returns an error on an invalid
first argument that is hardcoded at these call sites.
The updated logic uses set_current_blocked() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# 154989e4 22-Nov-2018 Christoph Hellwig <hch@lst.de>

aio: clear IOCB_HIPRI

No one is going to poll for aio (yet), so we must clear the HIPRI
flag, as we would otherwise send it down the poll queues, where no
one will be polling for completions.

Signed-off-by: Christoph Hellwig <hch@lst.de>

IOCB_HIPRI, not RWF_HIPRI.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 76dc8913 19-Nov-2018 Damien Le Moal <damien.lemoal@wdc.com>

aio: Fix fallback I/O priority value

For cases when the application does not specify aio_reqprio for an aio,
fallback to use get_current_ioprio() to obtain the task I/O priority
last set using ioprio_set() rather than the hardcoded IOPRIO_CLASS_NONE
value.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 53fffe29 17-Nov-2018 Jens Axboe <axboe@kernel.dk>

aio: fix failure to put the file pointer

If the ioprio capability check fails, we return without putting
the file pointer.

Fixes: d9a08a9e616b ("fs: Add aio iopriority support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9afc5eee 12-Jul-2018 Arnd Bergmann <arnd@arndb.de>

y2038: globally rename compat_time to old_time32

Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:

Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.

The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:

old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()

As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.

I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.

This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# e8693bcf 15-Jul-2018 Christoph Hellwig <hch@lst.de>

aio: allow direct aio poll comletions for keyed wakeups

If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the
ctx_lock without spinning we can just complete the iocb straight from the
wakeup callback to avoid a context switch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>


# bfe4037e 16-Jul-2018 Christoph Hellwig <hch@lst.de>

aio: implement IOCB_CMD_POLL

Simple one-shot poll through the io_submit() interface. To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL. It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.

Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>


# 9018ccc4 24-Jul-2018 Christoph Hellwig <hch@lst.de>

aio: add a iocb refcount

This is needed to prevent races caused by the way the ->poll API works.
To avoid introducing overhead for other users of the iocbs we initialize
it to zero and only do refcount operations if it is non-zero in the
completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>


# 9ba546c0 11-Jul-2018 Christoph Hellwig <hch@lst.de>

aio: don't expose __aio_sigset in uapi

glibc uses a different defintion of sigset_t than the kernel does,
and the current version would pull in both. To fix this just do not
expose the type at all - this somewhat mirrors pselect() where we
do not even have a type for the magic sigmask argument, but just
use pointer arithmetics.

Fixes: 7a074e96 ("aio: implement io_pgetevents")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Adrian Reber <adrian@lisas.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d93aa9d8 09-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

new wrapper: alloc_file_pseudo()

takes inode, vfsmount, name, O_... flags and file_operations and
either returns a new struct file (in which case inode reference we
held is consumed) or returns ERR_PTR(), in which case no refcounts
are altered.

converted aio_private_file() and sock_alloc_file() to it

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c9c554f2 11-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

alloc_file(): switch to passing O_... flags instead of FMODE_... mode

... so that it could set both ->f_flags and ->f_mode, without callers
having to set ->f_flags manually.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a11e1d43 28-Jun-2018 Linus Torvalds <torvalds@linux-foundation.org>

Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL

The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2739b807 11-Jun-2018 Christoph Hellwig <hch@lst.de>

aio: only return events requested in poll_mask() for IOCB_CMD_POLL

The ->poll_mask() operation has a mask of events that the caller
is interested in, but not all implementations might take it into
account. Mask the return value to only the requested events,
similar to what the poll and epoll code does.

Reported-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9a6d9a62 04-Jun-2018 Adam Manzanares <adam.manzanares@wdc.com>

fs: aio ioprio use ioprio_check_cap ret val

Previously the value was ignored.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d9a08a9e 22-May-2018 Adam Manzanares <adam.manzanares@wdc.com>

fs: Add aio iopriority support

This is the per-I/O equivalent of the ioprio_set system call.

When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.

This patch depends on block: add ioprio_check_cap function.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fc28724d 22-May-2018 Adam Manzanares <adam.manzanares@wdc.com>

fs: Convert kiocb rw_hint from enum to u16

In order to avoid kiocb bloat for per command iopriority support, rw_hint
is converted from enum to a u16. Added a guard around ki_hint assignment.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1da92779 26-May-2018 Al Viro <viro@zeniv.linux.org.uk>

aio: sanitize the limit checking in io_submit(2)

as it is, the logics in native io_submit(2) is "if asked for
more than LONG_MAX/sizeof(pointer) iocbs to submit, don't
bother with more than LONG_MAX/sizeof(pointer)" (i.e.
512M requests on 32bit and 1E requests on 64bit) while
compat io_submit(2) goes with "stop after the first
PAGE_SIZE/sizeof(pointer) iocbs", i.e. 1K or so. Which is
* inconsistent
* *way* too much in native case
* possibly too little in compat one
and
* wrong anyway, since the natural point where we
ought to stop bothering is ctx->nr_events

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 67ba049f 26-May-2018 Al Viro <viro@zeniv.linux.org.uk>

aio: fold do_io_submit() into callers

get rid of insane "copy array of 32bit pointers into an array of
native ones" glue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 95af8496 26-May-2018 Al Viro <viro@zeniv.linux.org.uk>

aio: shift copyin of iocb into io_submit_one()

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d2988bd4 26-May-2018 Al Viro <viro@zeniv.linux.org.uk>

aio_read_events_ring(): make a bit more readable

The logics for 'avail' is
* not past the tail of cyclic buffer
* no more than asked
* not past the end of buffer
* not past the end of a page

Unobfuscate the last part.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9061d14a 26-May-2018 Al Viro <viro@zeniv.linux.org.uk>

aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way

... so just make them return 0 when caller does not need to destroy iocb

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3c96c7f4 28-May-2018 Al Viro <viro@zeniv.linux.org.uk>

aio: take list removal to (some) callers of aio_complete()

We really want iocb out of io_cancel(2) reach before we start tearing
it down.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ac060cba 27-May-2018 Christoph Hellwig <hch@lst.de>

aio: add missing break for the IOCB_CMD_FDSYNC case

Looks like this got lost in a merge.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1962da0d 20-May-2018 Christoph Hellwig <hch@lst.de>

aio: try to complete poll iocbs without context switch

If we can acquire ctx_lock without spinning we can just remove our
iocb from the active_reqs list, and thus complete the iocbs from the
wakeup context.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# 2c14fa83 20-Mar-2018 Christoph Hellwig <hch@lst.de>

aio: implement IOCB_CMD_POLL

Simple one-shot poll through the io_submit() interface. To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL. It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.

Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# 888933f8 23-May-2018 Christoph Hellwig <hch@lst.de>

aio: simplify cancellation

With the current aio code there is no need for the magic KIOCB_CANCELLED
value, as a cancelation just kicks the driver to queue the completion
ASAP, with all actual completion handling done in another thread. Given
that both the completion path and cancelation take the context lock there
is no need for magic cmpxchg loops either. If we remove iocbs from the
active list after calling ->ki_cancel (but with ctx_lock still held), we
can also rely on the invariant thay anything found on the list has a
->ki_cancel callback and can be cancelled, further simplifing the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# f3a2752a 30-Mar-2018 Christoph Hellwig <hch@lst.de>

aio: simplify KIOCB_KEY handling

No need to pass the key field to lookup_iocb to compare it with KIOCB_KEY,
as we can do that right after retrieving it from userspace. Also move the
KIOCB_KEY definition to aio.c as it is an internal value not used by any
other place in the kernel.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# 4faa9996 23-May-2018 Al Viro <viro@zeniv.linux.org.uk>

fix io_destroy()/aio_complete() race

If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion. At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().

Fix is simple - remove from the list after the call of kiocb_cancel(). All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).

Cc: stable@kernel.org
Fixes: 0460fef2a921 "aio: use cancellation list lazily"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# baf10564 20-May-2018 Al Viro <viro@zeniv.linux.org.uk>

aio: fix io_destroy(2) vs. lookup_ioctx() race

kill_ioctx() used to have an explicit RCU delay between removing the
reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
At some point that delay had been removed, on the theory that
percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was
the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
by lookup_ioctx(). As the result, we could get ctx freed right under
lookup_ioctx(). Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit
RCU grace period when freeing kioctx"); however, that fix is not enough.

Suppose io_destroy() from one thread races with e.g. io_setup() from another;
CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the
refcount, getting it to 0 and triggering a call of free_ioctx_users(),
which proceeds to drop the secondary refcount and once that reaches zero
calls free_ioctx_reqs(). That does
INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
queue_rcu_work(system_wq, &ctx->free_rwork);
and schedules freeing the whole thing after RCU delay.

In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
refcount from 0 to 1 and returned the reference to io_setup().

Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
freed until after percpu_ref_get(). Sure, we'd increment the counter before
ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to
stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it
has grabbed the reference, ctx is *NOT* going away until it gets around to
dropping that reference.

The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
It's not costlier than what we currently do in normal case, it's safe to
call since freeing *is* delayed and it closes the race window - either
lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
the object in question at all.

Cc: stable@kernel.org
Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7a074e96 02-May-2018 Christoph Hellwig <hch@lst.de>

aio: implement io_pgetevents

This is the io_getevents equivalent of ppoll/pselect and allows to
properly mix signals and aio completions (especially with IOCB_CMD_POLL)
and atomically executes the following sequence:

sigset_t origmask;

pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ret = io_getevents(ctx, min_nr, nr, events, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);

Note that unlike many other signal related calls we do not pass a sigmask
size, as that would get us to 7 arguments, which aren't easily supported
by the syscall infrastructure. It seems a lot less painful to just add a
new syscall variant in the unlikely case we're going to increase the
sigset size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# a3c0d439 27-Mar-2018 Christoph Hellwig <hch@lst.de>

aio: implement IOCB_CMD_FSYNC and IOCB_CMD_FDSYNC

Simple workqueue offload for now, but prepared for adding a real aio_fsync
method if the need arises. Based on an earlier patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# 54843f87 02-May-2018 Christoph Hellwig <hch@lst.de>

aio: refactor read/write iocb setup

Don't reference the kiocb structure from the common aio code, and move
any use of it into helper specific to the read/write path. This is in
preparation for aio_poll support that wants to use the space for different
fields.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# 92ce4728 06-Apr-2018 Christoph Hellwig <hch@lst.de>

aio: remove the extra get_file/fput pair in io_submit_one

If we release the lockdep write protection token before calling into
->write_iter and thus never access the file pointer after an -EIOCBQUEUED
return from ->write_iter or ->read_iter we don't need this extra
reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# 75321b50 09-Apr-2018 Christoph Hellwig <hch@lst.de>

aio: sanitize ki_list handling

Instead of handcoded non-null checks always initialize ki_list to an
empty list and use list_empty / list_empty_careful on it. While we're
at it also error out on a double call to kiocb_set_cancel_fn instead
of ignoring it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# c213dc82 23-Nov-2017 Christoph Hellwig <hch@lst.de>

aio: remove an outdated BUG_ON and comment in aio_complete

These days we don't treat sync iocbs special in the aio completion code as
they never use it. Remove the old comment and BUG_ON given that the
current definition of is_sync_kiocb makes it impossible to hit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# 01a658e1 14-Dec-2017 Christoph Hellwig <hch@lst.de>

aio: don't print the page size at boot time

The page size is in no way related to the aio code, and printing it in
the (debug) dmesg at every boot serves no purpose.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# f729863a 14-Mar-2018 Tejun Heo <tj@kernel.org>

fs/aio: Use rcu_work instead of explicit rcu and work item

Workqueue now has rcu_work. Use it instead of open-coding rcu -> work
item bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>


# d0264c01 14-Mar-2018 Tejun Heo <tj@kernel.org>

fs/aio: Use RCU accessors for kioctx_table->table[]

While converting ioctx index from a list to a table, db446a08c23d
("aio: convert the ioctx list to table lookup v3") missed tagging
kioctx_table->table[] as an array of RCU pointers and using the
appropriate RCU accessors. This introduces a small window in the
lookup path where init and access may race.

Mark kioctx_table->table[] with __rcu and use the approriate RCU
accessors when using the field.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3")
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # v3.12+


# a6d7cff4 14-Mar-2018 Tejun Heo <tj@kernel.org>

fs/aio: Add explicit RCU grace period when freeing kioctx

While fixing refcounting, e34ecee2ae79 ("aio: Fix a trinity splat")
incorrectly removed explicit RCU grace period before freeing kioctx.
The intention seems to be depending on the internal RCU grace periods
of percpu_ref; however, percpu_ref uses a different flavor of RCU,
sched-RCU. This can lead to kioctx being freed while RCU read
protected dereferences are still in progress.

Fix it by updating free_ioctx() to go through call_rcu() explicitly.

v2: Comment added to explain double bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: e34ecee2ae79 ("aio: Fix a trinity splat")
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # v3.13+


# 6aa7de05 23-Oct-2017 Mark Rutland <mark.rutland@arm.com>

locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()

Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# fa2e62a5 04-Aug-2017 Deepa Dinamani <deepa.kernel@gmail.com>

io_getevents: Use timespec64 to represent timeouts

struct timespec is not y2038 safe. Use y2038 safe
struct timespec64 to represent timeouts.
The system call interface itself will be changed as
part of different series.

Timeouts will not really need more than 32 bits.
But, replacing these with timespec64 helps verification
of a y2038 safe kernel by getting rid of timespec
internally.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: linux-aio@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2916ecc0 08-Sep-2017 Jérôme Glisse <jglisse@redhat.com>

mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY

Introduce a new migration mode that allow to offload the copy to a device
DMA engine. This changes the workflow of migration and not all
address_space migratepage callback can support this.

This is intended to be use by migrate_vma() which itself is use for thing
like HMM (see include/linux/hmm.h).

No additional per-filesystem migratepage testing is needed. I disables
MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
added comment in those to explain why (part of this patch). The commit
message is unclear it should say that any callback that wish to support
this new mode need to be aware of the difference in the migration flow
from other mode.

Some of these callbacks do extra locking while copying (aio, zsmalloc,
balloon, ...) and for DMA to be effective you want to copy multiple
pages in one DMA operations. But in the problematic case you can not
easily hold the extra lock accross multiple call to this callback.

Usual flow is:

For each page {
1 - lock page
2 - call migratepage() callback
3 - (extra locking in some migratepage() callback)
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
5 - copy page
6 - (unlock any extra lock of migratepage() callback)
7 - return from migratepage() callback
8 - unlock page
}

The new mode MIGRATE_SYNC_NO_COPY:
1 - lock multiple pages
For each page {
2 - call migratepage() callback
3 - abort in all problematic migratepage() callback
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
} // finished all calls to migratepage() callback
5 - DMA copy multiple pages
6 - unlock all the pages

To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
new callback migratepages() (for instance) that deals with multiple
pages in one transaction.

Because the problematic cases are not important for current usage I did
not wanted to complexify this patchset even more for no good reason.

Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a8a9867 05-Jul-2017 Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>

fs: aio: fix the increment of aio-nr and counting against aio-max-nr

Currently, aio-nr is incremented in steps of 'num_possible_cpus() * 8'
for io_setup(nr_events, ..) with 'nr_events < num_possible_cpus() * 4':

ioctx_alloc()
...
nr_events = max(nr_events, num_possible_cpus() * 4);
nr_events *= 2;
...
ctx->max_reqs = nr_events;
...
aio_nr += ctx->max_reqs;
....

This limits the number of aio contexts actually available to much less
than aio-max-nr, and is increasingly worse with greater number of CPUs.

For example, with 64 CPUs, only 256 aio contexts are actually available
(with aio-max-nr = 65536) because the increment is 512 in that scenario.

Note: 65536 [max aio contexts] / (64*4*2) [increment per aio context]
is 128, but make it 256 (double) as counting against 'aio-max-nr * 2':

ioctx_alloc()
...
if (aio_nr + nr_events > (aio_max_nr * 2UL) ||
...
goto err_ctx;
...

This patch uses the original value of nr_events (from userspace) to
increment aio-nr and count against aio-max-nr, which resolves those.

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Reported-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Paul Nguyen <nguyenp@us.ibm.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 91f9943e 29-Aug-2017 Christoph Hellwig <hch@lst.de>

fs: support RWF_NOWAIT for buffered reads

This is based on the old idea and code from Milosz Tanski. With the aio
nowait code it becomes mostly trivial now. Buffered writes continue to
return -EOPNOTSUPP if RWF_NOWAIT is passed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 45d06cf7 27-Jun-2017 Jens Axboe <axboe@kernel.dk>

fs: add O_DIRECT and aio support for sending down write life time hints

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b745fafa 20-Jun-2017 Goldwyn Rodrigues <rgoldwyn@suse.com>

fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT

RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable
of returning -EAGAIN if the AIO call will block. This must be set by
supporting filesystems in the ->open() call.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9830f4be 20-Jun-2017 Goldwyn Rodrigues <rgoldwyn@suse.com>

fs: Use RWF_* flags for AIO operations

aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
carry the RWF_* flags. We cannot use aio_flags because they are not
checked for validity which may break existing applications.

Note, the only place RWF_HIPRI comes in effect is dio_await_one().
All the rest of the locations, aio code return -EIOCBQUEUED before the
checks for RWF_HIPRI.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 174cd4b1 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 897ab3e0 24-Feb-2017 Mike Rapoport <rppt@linux.vnet.ibm.com>

userfaultfd: non-cooperative: add event for memory unmaps

When a non-cooperative userfaultfd monitor copies pages in the
background, it may encounter regions that were already unmapped.
Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
changes in the virtual memory layout.

Since there might be different uffd contexts for the affected VMAs, we
first should create a temporary representation for the unmap event for
each uffd context and then notify them one by one to the appropriate
userfault file descriptors.

The event notification occurs after the mmap_sem has been released.

[arnd@arndb.de: fix nommu build]
Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
[mhocko@suse.com: fix nommu build]
Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bb7462b6 20-Feb-2017 Miklos Szeredi <mszeredi@redhat.com>

vfs: use helpers for calling f_op->{read,write}_iter()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a12f1ae6 13-Dec-2016 Shaohua Li <shli@fb.com>

aio: fix lock dep warning

lockdep reports a warnning. file_start_write/file_end_write only
acquire/release the lock for regular files. So checking the files in aio
side too.

[ 453.532141] ------------[ cut here ]------------
[ 453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
[ 453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0)
[ 453.533011] Modules linked in:
[ 453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964
[ 453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[ 453.533011] ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000
[ 453.533011] ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008
[ 453.533011] ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef
[ 453.533011] Call Trace:
[ 453.533011] [<ffffffff8196cffb>] dump_stack+0x67/0x9c
[ 453.533011] [<ffffffff81091ee1>] __warn+0x111/0x130
[ 453.533011] [<ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0
[ 453.533011] [<ffffffff81091f00>] ? __warn+0x130/0x130
[ 453.533011] [<ffffffff8191b789>] ? blk_finish_plug+0x29/0x60
[ 453.533011] [<ffffffff811205d4>] lock_release+0x434/0x670
[ 453.533011] [<ffffffff8198af94>] ? import_single_range+0xd4/0x110
[ 453.533011] [<ffffffff81322195>] ? rw_verify_area+0x65/0x140
[ 453.533011] [<ffffffff813aa696>] ? aio_write+0x1f6/0x280
[ 453.533011] [<ffffffff813aa6c9>] aio_write+0x229/0x280
[ 453.533011] [<ffffffff813aa4a0>] ? aio_complete+0x640/0x640
[ 453.533011] [<ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0
[ 453.533011] [<ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
[ 453.533011] [<ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40
[ 453.533011] [<ffffffff812a92be>] ? __might_fault+0x7e/0xf0
[ 453.533011] [<ffffffff813ac9bc>] do_io_submit+0x94c/0xb10
[ 453.533011] [<ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10
[ 453.533011] [<ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270
[ 453.533011] [<ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0
[ 453.533011] [<ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 453.533011] [<ffffffff813acb90>] SyS_io_submit+0x10/0x20
[ 453.533011] [<ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad
[ 453.533011] [<ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110
[ 453.533011] ---[ end trace b2fbe664d1cc0082 ]---

Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2456e855 25-Dec-2016 Thomas Gleixner <tglx@linutronix.de>

ktime: Get rid of the union

ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.

Get rid of the union and just keep ktime_t as simple typedef of type s64.

The conversion was done with coccinelle and some manual mopping up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>


# 7c0f6ba6 24-Dec-2016 Linus Torvalds <torvalds@linux-foundation.org>

Replace <asm/uaccess.h> with <linux/uaccess.h> globally

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c00d2c7e 20-Dec-2016 Al Viro <viro@zeniv.linux.org.uk>

move aio compat to fs/aio.c

... and fix the minor buglet in compat io_submit() - native one
kills ioctx as cleanup when put_user() fails. Get rid of
bogus compat_... in !CONFIG_AIO case, while we are at it - they
should simply fail with ENOSYS, same as for native counterparts.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 45063097 04-Dec-2016 Al Viro <viro@zeniv.linux.org.uk>

don't open-code file_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 70fe2f48 30-Oct-2016 Jan Kara <jack@suse.cz>

aio: fix freeze protection of aio writes

Currently we dropped freeze protection of aio writes just after IO was
submitted. Thus aio write could be in flight while the filesystem was
frozen and that could result in unexpected situation like aio completion
wanting to convert extent type on frozen filesystem. Testcase from
Dmitry triggering this is like:

for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done &
fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \
--runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite

Fix the problem by dropping freeze protection only once IO is completed
in aio_complete().

Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
[hch: forward ported on top of various VFS and aio changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 89319d31 30-Oct-2016 Christoph Hellwig <hch@lst.de>

fs: remove aio_run_iocb

Pass the ABI iocb structure to aio_setup_rw and let it handle the
non-vectored I/O case as well. With that and a new helper for the AIO
return value handling we can now define new aio_read and aio_write
helpers that implement reads and writes in a self-contained way without
duplicating too much code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 723c0384 30-Oct-2016 Christoph Hellwig <hch@lst.de>

fs: remove the never implemented aio_fsync file operation

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0b944d3a 30-Oct-2016 Christoph Hellwig <hch@lst.de>

aio: hold an extra file reference over AIO read/write operations

Otherwise we might dereference an already freed file and/or inode
when aio_complete is called before we return from the read_iter or
write_iter method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# de04e769 14-Sep-2016 Rasmus Villemoes <linux@rasmusvillemoes.dk>

fs/aio.c: eliminate redundant loads in put_aio_ring_file

Using a local variable we can prevent gcc from reloading
aio_ring_file->f_inode->i_mapping twice, eliminating 2x2 dependent
loads.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 22f6b4d3 15-Sep-2016 Jann Horn <jann@thejh.net>

aio: mark AIO pseudo-fs noexec

This ensures that do_mmap() won't implicitly make AIO memory mappings
executable if the READ_IMPLIES_EXEC personality flag is set. Such
behavior is problematic because the security_mmap_file LSM hook doesn't
catch this case, potentially permitting an attacker to bypass a W^X
policy enforced by SELinux.

I have tested the patch on my machine.

To test the behavior, compile and run this:

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/personality.h>
#include <linux/aio_abi.h>
#include <err.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/syscall.h>

int main(void) {
personality(READ_IMPLIES_EXEC);
aio_context_t ctx = 0;
if (syscall(__NR_io_setup, 1, &ctx))
err(1, "io_setup");

char cmd[1000];
sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
(int)getpid());
system(cmd);
return 0;
}

In the output, "rw-s" is good, "rwxs" is bad.

Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 013373e8 23-May-2016 Michal Hocko <mhocko@suse.com>

aio: make aio_setup_ring killable

aio_setup_ring waits for mmap_sem in writable mode. If the waiting task
gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting. This will also expedite the
return to the userspace and do_exit.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Benamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2958ec17 31-Mar-2016 Al Viro <viro@zeniv.linux.org.uk>

aio: remove a pointless assignment

the value is never used after that point

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5477e70a 04-Sep-2015 Oleg Nesterov <oleg@redhat.com>

mm: move ->mremap() from file_operations to vm_operations_struct

vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
way ->mremap() can have more users. Say, vdso.

While at it, s/aio_ring_remap/aio_ring_mremap/.

Note: this is the minimal change before ->mremap() finds another user in
file_operations; this method should have more arguments, and it can be
used to kill arch_remap().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc48e56d 15-Apr-2015 Jens Axboe <axboe@fb.com>

aio: fix serial draining in exit_aio()

exit_aio() currently serializes killing io contexts. Each context
killing ends up having to do percpu_ref_kill(), which in turns has
to wait for an RCU grace period. This can take a long time, depending
on the number of contexts. And there's no point in doing them serially,
when we could be waiting for all of them in one fell swoop.

This patches makes my fio thread offload test case exit 0.2s instead
of almost 6s.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 2ba48ce5 09-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

mirror O_APPEND and O_DIRECT into iocb->ki_flags

... avoiding write_iter/fcntl races.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 84363182 03-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

->aio_read and ->aio_write removed

no remaining users

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 47e39362 31-Mar-2015 Al Viro <viro@zeniv.linux.org.uk>

aio_run_iocb(): kill dead check

We check if ->ki_pos is positive. However, by that point we have
already done rw_verify_area(), which would have rejected such
unless the file had been one of /dev/mem, /dev/kmem and /proc/kcore.
All of which do not have vectored rw methods, so we would've bailed
out even earlier.

This check had been introduced before rw_verify_area() had been added there
- in fact, it was a subset of checks done on sync paths by rw_verify_area()
(back then the /dev/mem exception didn't exist at all). The rest of checks
(mandatory locking, etc.) hadn't been added until later. Unfortunately,
by the time the call of rw_verify_area() got added, the /dev/mem exception
had already appeared, so it wasn't obvious that the older explicit check
downstream had become dead code. It *is* a dead code, though, since the few
files for which the exception applies do not have ->aio_{read,write}() or
->{read,write}_iter() and for them we won't reach that check anyway.

What's more, even if we ever introduce vectored methods for /dev/mem
and friends, they'll have to cope with negative positions anyway, since
readv(2) and writev(2) are using the same checks as read(2) and write(2) -
i.e. rw_verify_area().

Let's bury it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 08397acd 31-Mar-2015 Al Viro <viro@zeniv.linux.org.uk>

ioctx_alloc(): remove pointless check

Way, way back kiocb used to be picked from arrays, so ioctx_alloc()
checked for multiplication overflow when calculating the size of
such array. By the time fs/aio.c went into the tree (in 2002) they
were already allocated one-by-one by kmem_cache_alloc(), so that
check had already become pointless. Let's bury it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 32a56afa 21-Mar-2015 Al Viro <viro@zeniv.linux.org.uk>

aio_setup_vectored_rw(): switch to {compat_,}import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d4fb392f 21-Mar-2015 Al Viro <viro@zeniv.linux.org.uk>

kill aio_setup_single_vector()

identical to import_single_range()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a96114fa 20-Mar-2015 Al Viro <viro@zeniv.linux.org.uk>

aio: simplify arguments of aio_setup_..._rw()

We don't need req in either of those. We don't need nr_segs in caller.
We don't really need len in caller either - iov_iter_count(&iter) will do.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4c185ce0 20-Mar-2015 Al Viro <viro@zeniv.linux.org.uk>

aio: lift iov_iter_init() into aio_setup_..._rw()

the only non-trivial detail is that we do it before rw_verify_area(),
so we'd better cap the length ourselves in aio_setup_single_rw()
case (for vectored case rw_copy_check_uvector() will do that for us).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# deeb8525 06-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

ioctx_alloc(): fix vma (and file) leak on failure

If we fail past the aio_setup_ring(), we need to destroy the
mapping. We don't need to care about anybody having found ctx,
or added requests to it, since the last failure exit is exactly
the failure to make ctx visible to lookups.

Reproducer (based on one by Joe Mario <jmario@redhat.com>):

void count(char *p)
{
char s[80];
printf("%s: ", p);
fflush(stdout);
sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid());
system(s);
}

int main()
{
io_context_t *ctx;
int created, limit, i, destroyed;
FILE *f;

count("before");
if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL)
perror("opening aio-max-nr");
else if (fscanf(f, "%d", &limit) != 1)
fprintf(stderr, "can't parse aio-max-nr\n");
else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL)
perror("allocating aio_context_t array");
else {
for (i = 0, created = 0; i < limit; i++) {
if (io_setup(1000, ctx + created) == 0)
created++;
}
for (i = 0, destroyed = 0; i < created; i++)
if (io_destroy(ctx[i]) == 0)
destroyed++;
printf("created %d, failed %d, destroyed %d\n",
created, limit - created, destroyed);
count("after");
}
}

Found-by: Joe Mario <jmario@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b2edffdd 06-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

fix mremap() vs. ioctx_kill() race

teach ->mremap() method to return an error and have it fail for
aio mappings in process of being killed

Note that in case of ->mremap() failure we need to undo move_page_tables()
we'd already done; we could call ->mremap() first, but then the failure of
move_page_tables() would require undoing whatever _successful_ ->mremap()
has done, which would be a lot more headache in general.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 04b2fa9f 02-Feb-2015 Christoph Hellwig <hch@lst.de>

fs: split generic and aio kiocb

Most callers in the kernel want to perform synchronous file I/O, but
still have to bloat the stack with a full struct kiocb. Split out
the parts needed in filesystem code from those in the aio code, and
only allocate those needed to pass down argument on the stack. The
aio code embedds the generic iocb in the one it allocates and can
easily get back to it by using container_of.

Also add a ->ki_complete method to struct kiocb, this is used to call
into the aio code and thus removes the dependency on aio for filesystems
impementing asynchronous operations. It will also allow other callers
to substitute their own completion callback.

We also add a new ->ki_flags field to work around the nasty layering
violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add
eventfd notification about ffs events").

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 599bd19b 11-Feb-2015 Christoph Hellwig <hch@lst.de>

fs: don't allow to complete sync iocbs through aio_complete

The AIO interface is fairly complex because it tries to allow
filesystems to always work async and then wakeup a synchronous
caller through aio_complete. It turns out that basically no one
was doing this to avoid the complexity and context switches,
and we've already fixed up the remaining users and can now
get rid of this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 66ee59af 11-Feb-2015 Christoph Hellwig <hch@lst.de>

fs: remove ki_nbytes

There is no need to pass the total request length in the kiocb, as
we already get passed in through the iov_iter argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# acd88d4e 04-Feb-2015 Kinglong Mee <kinglongmee@gmail.com>

fs/aio.c: Remove duplicate function name in pr_debug messages

Have defined pr_fmt as below in fs/aio.c, so remove duplicate
function name in pr_debug message.

#define pr_fmt(fmt) "%s: " fmt, __func__

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9c9ce763 03-Feb-2015 Dave Chinner <dchinner@redhat.com>

aio: annotate aio_read_event_ring for sleep patterns

Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw
warnings like the following due to being called from wait_event
context:

WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810d85a3>] prepare_to_wait_event+0x63/0x110
Modules linked in:
CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8
ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8
ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300
Call Trace:
[<ffffffff81daf2bd>] dump_stack+0x4c/0x65
[<ffffffff8109beda>] warn_slowpath_common+0x8a/0xc0
[<ffffffff8109bf56>] warn_slowpath_fmt+0x46/0x50
[<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110
[<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110
[<ffffffff810bdfcf>] __might_sleep+0x7f/0x90
[<ffffffff81db8344>] mutex_lock+0x24/0x45
[<ffffffff81216b7c>] aio_read_events+0x4c/0x290
[<ffffffff81216fac>] read_events+0x1ec/0x220
[<ffffffff810d8650>] ? prepare_to_wait_event+0x110/0x110
[<ffffffff810fdb10>] ? hrtimer_get_res+0x50/0x50
[<ffffffff8121899d>] SyS_io_getevents+0x4d/0xb0
[<ffffffff81dba5a9>] system_call_fastpath+0x12/0x17
---[ end trace bde69eaf655a4fea ]---

There is not actually a bug here, so annotate the code to tell the
debug logic that everything is just fine and not to fire a false
positive.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# b83ae6d4 14-Jan-2015 Christoph Hellwig <hch@lst.de>

fs: remove mapping->backing_dev_info

Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b4caecd4 14-Jan-2015 Christoph Hellwig <hch@lst.de>

fs: introduce f_op->mmap_capabilities for nommu mmap support

Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.

Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead. Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device. It also removes the need for
the mtd_inodefs filesystem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 5f785de5 06-Nov-2014 Fam Zheng <famz@redhat.com>

aio: Skip timer for io_getevents if timeout=0

In this case, it is basically a polling. Let's not involve timer at all
because that would hurt performance for application event loops.

In an arbitrary test I've done, io_getevents syscall elapsed time
reduces from 50000+ nanoseconds to a few hundereds.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# e4a0d3e7 18-Sep-2014 Pavel Emelyanov <xemul@parallels.com>

aio: Make it possible to remap aio ring

There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.

So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).

To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.

So, to make restore work we need to make sure that

a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value

Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.

Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.

What do you think?

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 835f252c 06-Nov-2014 Gu Zheng <guz.fnst@cn.fujitsu.com>

aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

https://bugzilla.kernel.org/show_bug.cgi?id=86831

Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.

Add a warn trace in __dec_zone_state will catch this easily:

static inline void __dec_zone_state(struct zone *zone, enum
zone_stat_item item)
{
atomic_long_dec(&zone->vm_stat[item]);
+ WARN_ON_ONCE(item == NR_FILE_DIRTY &&
atomic_long_read(&zone->vm_stat[item]) < 0);
atomic_long_dec(&vm_stat[item]);
}

[ 21.341632] ------------[ cut here ]------------
[ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[ 21.355296] Modules linked in: wutbox_cp sata_mv
[ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[ 21.366793] Workqueue: events free_ioctx
[ 21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
(show_stack+0x20/0x24)
[ 21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
(dump_stack+0x24/0x28)
[ 21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
(warn_slowpath_common+0x84/0x9c)
[ 21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
(warn_slowpath_null+0x2c/0x34)
[ 21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
(cancel_dirty_page+0x164/0x224)
[ 21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
(truncate_inode_page+0x8c/0x158)
[ 21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
(truncate_inode_pages_range+0x11c/0x53c)
[ 21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
[<c00c0f6c>] (truncate_pagecache+0x88/0xac)
[ 21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
(truncate_setsize+0x5c/0x74)
[ 21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
(put_aio_ring_file.isra.14+0x34/0x90)
[ 21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
[<c013b424>] (aio_free_ring+0x20/0xcc)
[ 21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
(free_ioctx+0x24/0x44)
[ 21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
(process_one_work+0x134/0x47c)
[ 21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
(worker_thread+0x130/0x414)
[ 21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
(kthread+0xd4/0xec)
[ 21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
(ret_from_fork+0x14/0x20)
[ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.

In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.

Reported-by: Markus Königshaus <m.koenigshaus@wut.de>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: stable <stable@vger.kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 2aad2a86 24-Sep-2014 Tejun Heo <tj@kernel.org>

percpu_ref: add PERCPU_REF_INIT_* flags

With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>


# a34375ef 07-Sep-2014 Tejun Heo <tj@kernel.org>

percpu-refcount: add @gfp to percpu_ref_init()

Percpu allocator now supports allocation mask. Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.

This patch doesn't make any functional difference.

v2: blk-mq conversion was missing. Updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>


# 6098b45b 03-Sep-2014 Gu Zheng <guz.fnst@cn.fujitsu.com>

aio: block exit_aio() until all context requests are completed

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org


# 2ff396be 02-Sep-2014 Jeff Moyer <jmoyer@redhat.com>

aio: add missing smp_rmb() in read_events_ring

We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events. After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves. This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org


# d856f32a 24-Aug-2014 Benjamin LaHaise <bcrl@kvack.org>

aio: fix reqs_available handling

As reported by Dan Aloni, commit f8567a3845ac ("aio: fix aio request
leak when events are reaped by userspace") introduces a regression when
user code attempts to perform io_submit() with more events than are
available in the ring buffer. Reverting that commit would reintroduce a
regression when user space event reaping is used.

Fixing this bug is a bit more involved than the previous attempts to fix
this regression. Since we do not have a single point at which we can
count events as being reaped by user space and io_getevents(), we have
to track event completion by looking at the number of events left in the
event ring. So long as there are as many events in the ring buffer as
there have been completion events generate, we cannot call
put_reqs_available(). The code to check for this is now placed in
refill_reqs_available().

A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .

Reported-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Dan Aloni <dan@kernelim.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: stable@vger.kernel.org # v3.16 and anything that f8567a3845ac was backported to
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 00fefb9c 23-Jul-2014 Gu Zheng <guz.fnst@cn.fujitsu.com>

aio: use iovec array rather than the single one

Previously, we only offer a single iovec to handle all the read/write cases, so
the PREADV/PWRITEV request always need to alloc more iovec buffer when copying
user vectors.
If we use a tmp iovec array rather than the single one, some small PREADV/PWRITEV
workloads(vector size small than the tmp buffer) will not need to alloc more
iovec buffer when copying user vectors.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 2be4e7de 23-Jul-2014 Gu Zheng <guz.fnst@cn.fujitsu.com>

aio: fix some comments

The function comments of aio_run_iocb and aio_read_events are out of date, so
fix them here.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 8dc4379e 23-Jul-2014 Gu Zheng <guz.fnst@cn.fujitsu.com>

aio: use the macro rather than the inline magic number

Replace the inline magic number with the ready-made macro(AIO_RING_MAGIC),
just clean up.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# b53f1f82 23-Jul-2014 Gu Zheng <guz.fnst@cn.fujitsu.com>

aio: remove the needless registration of ring file's private_data

Remove the registration of ring file's private_data, we do not use
it.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# be6fb451 22-Jul-2014 Benjamin LaHaise <bcrl@kvack.org>

aio: remove no longer needed preempt_disable()

Based on feedback from Jens Axboe on 263782c1c95bbddbb022dc092fd89a36bb8d5577,
clean up get/put_reqs_available() to remove the no longer needed preempt_disable()
and preempt_enable() pair.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>


# 263782c1 13-Jul-2014 Benjamin LaHaise <bcrl@kvack.org>

aio: protect reqs_available updates from changes in interrupt handlers

As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to
have put_reqs_available() called from irq context. While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU. This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott. Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.

Reported-by: Robert Elliot <Elliott@hp.com>
Tested-by: Robert Elliot <Elliott@hp.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kenel.org


# 9a1049da 28-Jun-2014 Tejun Heo <tj@kernel.org>

percpu-refcount: require percpu_ref to be exited explicitly

Currently, a percpu_ref undoes percpu_ref_init() automatically by
freeing the allocated percpu area when the percpu_ref is killed.
While seemingly convenient, this has the following niggles.

* It's impossible to re-init a released reference counter without
going through re-allocation.

* In the similar vein, it's impossible to initialize a percpu_ref
count with static percpu variables.

* We need and have an explicit destructor anyway for failure paths -
percpu_ref_cancel_init().

This patch removes the automatic percpu counter freeing in
percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
generic destructor now named percpu_ref_exit(). percpu_ref_destroy()
is considered but it gets confusing with percpu_ref_kill() while
"exit" clearly indicates that it's the counterpart of
percpu_ref_init().

All percpu_ref_cancel_init() users are updated to invoke
percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
added to the destruction path of all percpu_ref users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Li Zefan <lizefan@huawei.com>


# 55c6c814 28-Jun-2014 Tejun Heo <tj@kernel.org>

percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()

ioctx_alloc() reaches inside percpu_ref and directly frees
->pcpu_count in its failure path, which is quite gross. percpu_ref
has been providing a proper interface to do this,
percpu_ref_cancel_init(), for quite some time now. Let's use that
instead.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>


# 855ef0de 30-Apr-2014 Oleg Nesterov <oleg@redhat.com>

aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()

ioctx_add_table() is the writer, it does not need rcu_read_lock() to
protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just
add the confusion.

And it doesn't need rcu_dereference() by the same reason, it must see
any updates previously done under the same ->ioctx_lock. We could use
rcu_dereference_protected() but the patch uses rcu_dereference_raw(),
the function is simple enough.

The same for kill_ioctx(), although it does not update the pointer.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 4b70ac5f 30-Apr-2014 Oleg Nesterov <oleg@redhat.com>

aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

On 04/30, Benjamin LaHaise wrote:
>
> > - ctx->mmap_size = 0;
> > -
> > - kill_ioctx(mm, ctx, NULL);
> > + if (ctx) {
> > + ctx->mmap_size = 0;
> > + kill_ioctx(mm, ctx, NULL);
> > + }
>
> Rather than indenting and moving the two lines changing mmap_size and the
> kill_ioctx() call, why not just do "if (!ctx) ... continue;"? That reduces
> the number of lines changed and avoid excessive indentation.

OK. To me the code looks better/simpler with "if (ctx)", but this is subjective
of course, I won't argue.

The patch still removes the empty line between mmap_size = 0 and kill_ioctx(),
we reset mmap_size only for kill_ioctx(). But feel free to remove this change.

-------------------------------------------------------------------------------
Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

1. We can read ->ioctx_table only once and we do not read rcu_read_lock()
or even rcu_dereference().

This mm has no users, nobody else can play with ->ioctx_table. Otherwise
the code is buggy anyway, if we need rcu_read_lock() in a loop because
->ioctx_table can be updated then kfree(table) is obviously wrong.

2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid
munmap(), but another reason is that we simply can't do vm_munmap() unless
current->mm == mm and this is not true in general, the caller is mmput().

3. We do not really need to nullify mm->ioctx_table before return, probably
the current code does this to catch the potential problems. But in this
case RCU_INIT_POINTER(NULL) looks better.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# edfbbf38 24-Jun-2014 Benjamin LaHaise <bcrl@kvack.org>

aio: fix kernel memory disclosure in io_getevents() introduced in v3.10

A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380bed817aa25f8830ad23e1a0480fef797. The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and
Petr for disclosing this issue.

This patch applies to v3.12+. A separate backport is needed for 3.10/3.11.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org


# f8567a38 24-Jun-2014 Benjamin LaHaise <bcrl@kvack.org>

aio: fix aio request leak when events are reaped by userspace

The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping. Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+. A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>


# 293bc982 11-Feb-2014 Al Viro <viro@zeniv.linux.org.uk>

new methods: ->read_iter() and ->write_iter()

Beginning to introduce those. Just the callers for now, and it's
clumsier than it'll eventually become; once we finish converting
aio_read and aio_write instances, the things will get nicer.

For now, these guys are in parallel to ->aio_read() and ->aio_write();
they take iocb and iov_iter, with everything in iov_iter already
validated. File offset is passed in iocb->ki_pos, iov/nr_segs -
in iov_iter.

Main concerns in that series are stack footprint and ability to
split the damn thing cleanly.

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 754320d6 30-Apr-2014 Leon Yu <chianglungyu@gmail.com>

aio: fix potential leak in aio_run_iocb().

iovec should be reclaimed whenever caller of rw_copy_check_uvector() returns,
but it doesn't hold when failure happens right after aio_setup_vectored_rw().

Fix that in a such way to avoid hairy goto.

Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org


# fa88b6f8 28-Apr-2014 Benjamin LaHaise <bcrl@kvack.org>

aio: cleanup: flatten kill_ioctx()

There is no need to have most of the code in kill_ioctx() indented. Flatten
it.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# fb2d4483 28-Apr-2014 Benjamin LaHaise <bcrl@kvack.org>

aio: report error from io_destroy() when threads race in io_destroy()

As reported by Anatol Pomozov, io_destroy() fails to report an error when
it loses the race to destroy a given ioctx. Since there is a difference in
behaviour between the thread that wins the race (which blocks on outstanding
io requests) versus lthe thread that loses (which returns immediately), wire
up a return code from kill_ioctx() to the io_destroy() syscall.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Anatol Pomozov <anatol.pomozov@gmail.com>


# d52a8f9e 21-Apr-2014 Fabian Frederick <fabf@skynet.be>

fs/aio.c: Remove ctx parameter in kiocb_cancel

ctx is no longer used in kiocb_cancel since

57282d8fd74407 ("aio: Kill ki_users")

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# e02ba72a 15-Apr-2014 Anatol Pomozov <anatol.pomozov@gmail.com>

aio: block io_destroy() until all context requests are completed

deletes aio context and all resources related to. It makes sense that
no IO operations connected to the context should be running after the context
is destroyed. As we removed io_context we have no chance to
get requests status or call io_getevents().

man page for io_destroy says that this function may block until
all context's requests are completed. Before kernel 3.11 io_destroy()
blocked indeed, but since aio refactoring in 3.11 it is not true anymore.

Here is a pseudo-code that shows a testcase for a race condition discovered
in 3.11:

initialize io_context
io_submit(read to buffer)
io_destroy()

// context is destroyed so we can free the resources
free(buffers);

// if the buffer is allocated by some other user he'll be surprised
// to learn that the buffer still filled by an outstanding operation
// from the destroyed io_context

The fix is straight-forward - add a completion struct and wait on it
in io_destroy, complete() should be called when number of in-fligh requests
reaches zero.

If two or more io_destroy() called for the same context simultaneously then
only the first one waits for IO completion, other calls behaviour is undefined.

Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
do not see the race condition anymore.

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# fa8a53c3 28-Mar-2014 Benjamin LaHaise <bcrl@kvack.org>

aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration

As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
exist in the aio ring page migration support.

As a result, for example, we have the following problem:

thread 1 | thread 2
|
aio_migratepage() |
|-> take ctx->completion_lock |
|-> migrate_page_copy(new, old) |
| *NOW*, ctx->ring_pages[idx] == old |
|
| *NOW*, ctx->ring_pages[idx] == old
| aio_read_events_ring()
| |-> ring = kmap_atomic(ctx->ring_pages[0])
| |-> ring->head = head; *HERE, write to the old ring page*
| |-> kunmap_atomic(ring);
|
|-> ctx->ring_pages[idx] = new |
| *BUT NOW*, the content of |
| ring_pages[idx] is old. |
|-> release ctx->completion_lock |

As above, the new ring page will not be updated.

Fix this issue, as well as prevent races in aio_ring_setup() by holding
the ring_lock mutex during kioctx setup and page migration. This avoids
the overhead of taking another spinlock in aio_read_events_ring() as Tang's
and Gu's original fix did, pushing the overhead into the migration code.

Note that to handle the nesting of ring_lock inside of mmap_sem, the
migratepage operation uses mutex_trylock(). Page migration is not a 100%
critical operation in this case, so the ocassional failure can be
tolerated. This issue was reported by Sasha Levin.

Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
Instead, make page migration fully serialised by mapping->private_lock, and
have aio_free_ring() simply disconnect the kioctx from the mapping by calling
put_aio_ring_file() before touching ctx->ring_pages[]. This simplifies the
error handling logic in aio_migratepage(), and should improve robustness.

v4: always do mutex_unlock() in cases when kioctx setup fails.

Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: stable@vger.kernel.org


# 3dc9acb6 19-Dec-2013 Linus Torvalds <torvalds@linux-foundation.org>

aio: clean up and fix aio_setup_ring page mapping

Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages
migration") the aio ring setup code has used a special per-ring backing
inode for the page allocations, rather than just using random anonymous
pages.

However, rather than remembering the pages as it allocated them, it
would allocate the pages, insert them into the file mapping (dirty, so
that they couldn't be free'd), and then forget about them. And then to
look them up again, it would mmap the mapping, and then use
"get_user_pages()" to get back an array of the pages we just created.

Now, not only is that incredibly inefficient, it also leaked all the
pages if the mmap failed (which could happen due to excessive number of
mappings, for example).

So clean it all up, making it much more straightforward. Also remove
some left-overs of the previous (broken) mm_populate() usage that was
removed in commit d6c355c7dabc ("aio: fix race in ring buffer page
lookup introduced by page migration support") but left the pointless and
now misleading MAP_POPULATE flag around.

Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e321fef 21-Dec-2013 Benjamin LaHaise <bcrl@kvack.org>

aio/migratepages: make aio migrate pages sane

The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.

While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 1881686f 21-Dec-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: fix kioctx leak introduced by "aio: Fix a trinity splat"

e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference
counting to correct a bug trinity found. Unfortunately, the change lead
to kioctxes being leaked because there was no final reference count to
put. Add that reference count back in to fix things.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org


# d1b94327 04-Dec-2013 Gu Zheng <guz.fnst@cn.fujitsu.com>

aio: clean up aio ring in the fail path

Clean up the aio ring file in the fail path of aio_setup_ring
and ioctx_alloc. And maybe it can fix the GPF issue reported by
Dave Jones:
https://lkml.org/lkml/2013/11/25/898

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# ddb8c45b 19-Nov-2013 Sasha Levin <sasha.levin@oracle.com>

aio: nullify aio->ring_pages after freeing it

After freeing ring_pages we leave it as is causing a dangling pointer. This
has already caused an issue so to help catching any issues in the future
NULL it out.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# d5580232 19-Nov-2013 Sasha Levin <sasha.levin@oracle.com>

aio: prevent double free in ioctx_alloc

ioctx_alloc() calls aio_setup_ring() to allocate a ring. If aio_setup_ring()
fails to do so it would call aio_free_ring() before returning, but
ioctx_alloc() would call aio_free_ring() again causing a double free of
the ring.

This is easily reproducible from userspace.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 7f62656b 13-Nov-2013 Dan Carpenter <dan.carpenter@oracle.com>

aio: checking for NULL instead of IS_ERR

alloc_anon_inode() returns an ERR_PTR(), it doesn't return NULL.

Fixes: 71ad7490c1f3 ('rework aio migrate pages to use aio fs')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 71ad7490 17-Sep-2013 Benjamin LaHaise <bcrl@kvack.org>

rework aio migrate pages to use aio fs

Don't abuse anon_inodes.c to host private files needed by aio;
we can bloody well declare a mini-fs of our own instead of
patching up what anon_inodes can create for us.

Tested-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e34ecee2 10-Oct-2013 Kent Overstreet <kmo@daterainc.com>

aio: Fix a trinity splat

aio kiocb refcounting was broken - it was relying on keeping track of
the number of available ring buffer entries, which it needs to do
anyways; then at shutdown time it'd wait for completions to be delivered
until the # of available ring buffer entries equalled what it was
initialized to.

Problem with that is that the ring buffer is mapped writable into
userspace, so userspace could futz with the head and tail pointers to
cause the kernel to see extra completions, and cause free_ioctx() to
return while there were still outstanding kiocbs. Which would be bad.

Fix is just to directly refcount the kiocbs - which is more
straightforward, and with the new percpu refcounting code doesn't cost
us any cacheline bouncing which was the whole point of the original
scheme.

Also clean up ioctx_alloc()'s error path and fix a bug where it wasn't
subtracting from aio_nr if ioctx_add_table() failed.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 5e9ae2e5 26-Sep-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: fix use-after-free in aio_migratepage

Dmitry Vyukov managed to trigger a case where aio_migratepage can cause a
use-after-free during teardown of the aio ring buffer's mapping. This turns
out to be caused by access to the ioctx's ring_pages via the migratepage
operation which was not being protected by any locks during ioctx freeing.
Use the address_space's private_lock to protect use and updates of the mapping's
private_data, and make ioctx teardown unlink the ioctx from the address space.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# d9b2c871 08-Sep-2013 Artem Savkov <artem.savkov@gmail.com>

aio: rcu_read_lock protection for new rcu_dereference calls

Patch "aio: fix rcu sparse warnings introduced by ioctx table lookup patch"
(77d30b14d24e557f89c41980011d72428514d729 in linux-next.git) introduced a
couple of new rcu_dereference calls which are not protected by rcu_read_lock
and result in following warnings during syscall fuzzing(trinity):

[ 471.646379] ===============================
[ 471.649727] [ INFO: suspicious RCU usage. ]
[ 471.653919] 3.11.0-next-20130906+ #496 Not tainted
[ 471.657792] -------------------------------
[ 471.661235] fs/aio.c:503 suspicious rcu_dereference_check() usage!
[ 471.665968]
[ 471.665968] other info that might help us debug this:
[ 471.665968]
[ 471.672141]
[ 471.672141] rcu_scheduler_active = 1, debug_locks = 1
[ 471.677549] 1 lock held by trinity-child0/3774:
[ 471.681675] #0: (&(&mm->ioctx_lock)->rlock){+.+...}, at: [<c119ba1a>] SyS_io_setup+0x63a/0xc70
[ 471.688721]
[ 471.688721] stack backtrace:
[ 471.692488] CPU: 1 PID: 3774 Comm: trinity-child0 Not tainted 3.11.0-next-20130906+ #496
[ 471.698437] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 471.703151] 00000000 00000000 c58bbf30 c18a814b de2234c0 c58bbf58 c10a4ec6 c1b0d824
[ 471.709544] c1b0f60e 00000001 00000001 c1af61b0 00000000 cb670ac0 c3aca000 c58bbfac
[ 471.716251] c119bc7c 00000002 00000001 00000000 c119b8dd 00000000 c10cf684 c58bbfb4
[ 471.722902] Call Trace:
[ 471.724859] [<c18a814b>] dump_stack+0x4b/0x66
[ 471.728772] [<c10a4ec6>] lockdep_rcu_suspicious+0xc6/0x100
[ 471.733716] [<c119bc7c>] SyS_io_setup+0x89c/0xc70
[ 471.737806] [<c119b8dd>] ? SyS_io_setup+0x4fd/0xc70
[ 471.741689] [<c10cf684>] ? __audit_syscall_entry+0x94/0xe0
[ 471.746080] [<c18b1fcc>] syscall_call+0x7/0xb
[ 471.749723] [<c1080000>] ? task_fork_fair+0x240/0x260

Signed-off-by: Artem Savkov <artem.savkov@gmail.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# d6c355c7 09-Sep-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: fix race in ring buffer page lookup introduced by page migration support

Prior to the introduction of page migration support in "fs/aio: Add support
to aio ring pages migration" / 36bc08cc01709b4a9bb563b35aa530241ddc63e3,
mapping of the ring buffer pages was done via get_user_pages() while
retaining mmap_sem held for write. This avoided possible races with userland
racing an munmap() or mremap(). The page migration patch, however, switched
to using mm_populate() to prime the page mapping. mm_populate() cannot be
called with mmap_sem held.

Instead of dropping the mmap_sem, revert to the old behaviour and simply
drop the use of mm_populate() since get_user_pages() will cause the pages to
get mapped anyways. Thanks to Al Viro for spotting this issue.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 77d30b14 30-Aug-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: fix rcu sparse warnings introduced by ioctx table lookup patch

Sseveral sparse warnings were caused by missing rcu_dereference() annotations
for dereferencing mm->ioctx_table. Thankfully, none of those were actual bugs
as the deref was protected by a spin lock in all instances.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>


# 79bd1bcf 30-Aug-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: remove unnecessary debugging from aio_free_ring()

The commit 36bc08cc0170 ("fs/aio: Add support to aio ring pages migration")
added some debugging code that is not required and resulted in a build error
when 98474236f72e ("vfs: make the dentry cache use the lockref infrastructure")
was added to the tree. The code is not required, so just delete it.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# f30d704f 07-Aug-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: table lookup: verify ctx pointer

Another shortcoming of the table lookup patch was revealed where the pointer
was not being tested before being dereferenced. Verify this to avoid the
NULL pointer dereference.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# da90382c 05-Aug-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"

In the patch "aio: convert the ioctx list to table lookup v3", incorrect
handling in the ioctx_alloc() error path was introduced that lead to an
ioctx being added via ioctx_add_table() while freed when the ioctx_alloc()
call returned -EAGAIN due to hitting the aio_max_nr limit. Fix this by
only calling ioctx_add_table() as the last step in ioctx_alloc().

Also, several unnecessary rcu_dereference() calls were added that lead to
RCU warnings where the system was already protected by a spin lock for
accessing mm->ioctx_table.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 6878ea72 31-Jul-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: be defensive to ensure request batching is non-zero instead of BUG_ON()

In the event that an overflow/underflow occurs while calculating req_batch,
clamp the minimum at 1 request instead of doing a BUG_ON().

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# db446a08 29-Jul-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: convert the ioctx list to table lookup v3

On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
> On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
> > When using a large number of threads performing AIO operations the
> > IOCTX list may get a significant number of entries which will cause
> > significant overhead. For example, when running this fio script:
> >
> > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=512; thread; loops=100
> >
> > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
> > 30% CPU time spent by lookup_ioctx:
> >
> > 32.51% [guest.kernel] [g] lookup_ioctx
> > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
> > 4.40% [guest.kernel] [g] lock_release
> > 4.19% [guest.kernel] [g] sched_clock_local
> > 3.86% [guest.kernel] [g] local_clock
> > 3.68% [guest.kernel] [g] native_sched_clock
> > 3.08% [guest.kernel] [g] sched_clock_cpu
> > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
> > 2.60% [guest.kernel] [g] memcpy
> > 2.33% [guest.kernel] [g] lock_acquired
> > 2.25% [guest.kernel] [g] lock_acquire
> > 1.84% [guest.kernel] [g] do_io_submit
> >
> > This patchs converts the ioctx list to a radix tree. For a performance
> > comparison the above FIO script was run on a 2 sockets 8 core
> > machine. This are the results (average and %rsd of 10 runs) for the
> > original list based implementation and for the radix tree based
> > implementation:
> >
> > cores 1 2 4 8 16 32
> > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
> > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
> > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
> > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
> > % of radix
> > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
> > to list
> >
> > To consider the impact of the patch on the typical case of having
> > only one ctx per process the following FIO script was run:
> >
> > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=1; thread; loops=100
> >
> > on the same system and the results are the following:
> >
> > list 58892 ms
> > %rsd 0.91%
> > radix 59404 ms
> > %rsd 0.81%
> > % of radix
> > relative 100.87%
> > to list
>
> So, I was just doing some benchmarking/profiling to get ready to send
> out the aio patches I've got for 3.11 - and it looks like your patch is
> causing a ~1.5% throughput regression in my testing :/
... <snip>

I've got an alternate approach for fixing this wart in lookup_ioctx()...
Instead of using an rbtree, just use the reserved id in the ring buffer
header to index an array pointing the ioctx. It's not finished yet, and
it needs to be tidied up, but is most of the way there.

-ben
--
"Thought is the essence of where you are now."
--
kmo> And, a rework of Ben's code, but this was entirely his idea
kmo> -Kent

bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
free memory.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 4cd81c3d 29-Jul-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: double aio_max_nr in calculations

With the changes to use percpu counters for aio event ring size calculation,
existing increases to aio_max_nr are now insufficient to allow for the
allocation of enough events. Double the value used for aio_max_nr to account
for the doubling introduced by the percpu slack.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# d29c445b 18-Mar-2013 Kent Overstreet <koverstreet@google.com>

aio: Kill ki_dtor

sock_aio_dtor() is dead code - and stuff that does need to do cleanup
can simply do it before calling aio_complete().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 57282d8f 13-May-2013 Kent Overstreet <koverstreet@google.com>

aio: Kill ki_users

The kiocb refcount is only needed for cancellation - to ensure a kiocb
isn't freed while a ki_cancel callback is running. But if we restrict
ki_cancel callbacks to not block (which they currently don't), we can
simply drop the refcount.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 8bc92afc 25-Feb-2013 Kent Overstreet <koverstreet@google.com>

aio: Kill unneeded kiocb members

The old aio retry infrastucture needed to save the various arguments to
to aio operations. But with the retry infrastructure gone, we can trim
struct kiocb quite a bit.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 73a7075e 09-May-2013 Kent Overstreet <koverstreet@google.com>

aio: Kill aio_rw_vect_retry()

This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.

This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 5ffac122 09-May-2013 Kent Overstreet <koverstreet@google.com>

aio: Don't use ctx->tail unnecessarily

aio_complete() (arguably) needs to keep its own trusted copy of the tail
pointer, but io_getevents() doesn't have to use it - it's already using
the head pointer from the ring buffer.

So convert it to use the tail from the ring buffer so it touches fewer
cachelines and doesn't contend with the cacheline aio_complete() needs.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# bec68faa 13-May-2013 Kent Overstreet <koverstreet@google.com>

aio: io_cancel() no longer returns the io_event

Originally, io_event() was documented to return the io_event if
cancellation succeeded - the io_event wouldn't be delivered via the ring
buffer like it normally would.

But this isn't what the implementation was actually doing; the only
driver implementing cancellation, the usb gadget code, never returned an
io_event in its cancel function. And aio_complete() was recently changed
to no longer suppress event delivery if the kiocb had been cancelled.

This gets rid of the unused io_event argument to kiocb_cancel() and
kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if
kiocb->ki_cancel() returned success.

Also tweak the refcounting in kiocb_cancel() to make more sense.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 723be6e3 28-May-2013 Kent Overstreet <koverstreet@google.com>

aio: percpu ioctx refcount

This just converts the ioctx refcount to the new generic dynamic percpu
refcount code.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# e1bdd5f2 25-Apr-2013 Kent Overstreet <koverstreet@google.com>

aio: percpu reqs_available

See the previous patch ("aio: reqs_active -> reqs_available") for why we
want to do this - this basically implements a per cpu allocator for
reqs_available that doesn't actually allocate anything.

Note that we need to increase the size of the ringbuffer we allocate,
since a single thread won't necessarily be able to use all the
reqs_available slots - some (up to about half) might be on other per cpu
lists, unavailable for the current thread.

We size the ringbuffer based on the nr_events userspace passed to
io_setup(), so this is a slight behaviour change - but nr_events wasn't
being used as a hard limit before, it was being rounded up to the next
page before so this doesn't change the actual semantics.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 34e83fc6 25-Apr-2013 Kent Overstreet <koverstreet@google.com>

aio: reqs_active -> reqs_available

The number of outstanding kiocbs is one of the few shared things left that
has to be touched for every kiocb - it'd be nice to make it percpu.

We can make it per cpu by treating it like an allocation problem: we have
a maximum number of kiocbs that can be outstanding (i.e. slots) - then we
just allocate and free slots, and we know how to write per cpu allocators.

So as prep work for that, we convert reqs_active to reqs_available.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 0c45355f 17-Jul-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: fix build when migration is disabled

When "fs/aio: Add support to aio ring pages migration" was applied, it
broke the build when CONFIG_MIGRATION was disabled. Wrap the migration
code with a test for CONFIG_MIGRATION to fix this and save a few bytes
when migration is disabled.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 36bc08cc 16-Jul-2013 Gu Zheng <guz.fnst@cn.fujitsu.com>

fs/aio: Add support to aio ring pages migration

As the aio job will pin the ring pages, that will lead to mem migrated
failed. In order to fix this problem we use an anon inode to manage the aio ring
pages, and setup the migratepage callback in the anon inode's address space, so
that when mem migrating the aio ring pages will be moved to other mem node safely.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>


# 4b30f07e 03-Jul-2013 Tang Chen <tangchen@cn.fujitsu.com>

aio: fix wrong comment in aio_complete()

ctx->ctx_lock should be ctx->completion_lock.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 68d70d03 19-Jun-2013 Al Viro <viro@zeniv.linux.org.uk>

constify rw_verify_area()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4fcc712f 12-Jun-2013 Kent Overstreet <koverstreet@google.com>

aio: fix io_destroy() regression by using call_rcu()

There was a regression introduced by 36f5588905c1 ("aio: refcounting
cleanup"), reported by Jens Axboe - the refcounting cleanup switched to
using RCU in the shutdown path, but the synchronize_rcu() was done in
the context of the io_destroy() syscall greatly increasing the time it
could block.

This patch switches it to call_rcu() and makes shutdown asynchronous
(more asynchronous than it was originally; before the refcount changes
io_destroy() would still wait on pending kiocbs).

Note that there's a global quota on the max outstanding kiocbs, and that
quota must be manipulated synchronously; otherwise io_setup() could
return -EAGAIN when there isn't quota available, and userspace won't
have any way of waiting until shutdown of the old kioctxs has finished
(besides busy looping).

So we release our quota before kioctx shutdown has finished, which
should be fine since the quota never corresponded to anything real
anyways.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Tested-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03e04f04 24-May-2013 Benjamin LaHaise <bcrl@kvack.org>

aio: fix kioctx not being freed after cancellation at exit time

The recent changes overhauling fs/aio.c introduced a bug that results in
the kioctx not being freed when outstanding kiocbs are cancelled at
exit_aio() time. Specifically, a kiocb that is cancelled has its
completion events discarded by batch_complete_aio(), which then fails to
wake up the process stuck in free_ioctx(). Fix this by modifying the
wait_event() condition in free_ioctx() appropriately.

This patch was tested with the cancel operation in the thread based code
posted yesterday.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Zach Brown <zab@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6900807c 24-May-2013 Jeff Moyer <jmoyer@redhat.com>

aio: fix io_getevents documentation

In reviewing man pages, I noticed that io_getevents is documented to
update the timeout that gets passed into the library call. This doesn't
happen in kernel space or in the library (even though it's documented to
do so in both places). Unless there is objection, I'd like to fix the
comments/docs to match the code (I will also update the man page upon
consensus).

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Cyril Hrubis <chrubis@suse.cz>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41ef4eb8 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: kill ki_retry

Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
need this anymore - ki_retry was only called right after the kiocb was
initialized.

This also refactors and trims some duplicated code, as well as cleaning up
the refcounting/error handling a bit.

[akpm@linux-foundation.org: use fmode_t in aio_run_iocb()]
[akpm@linux-foundation.org: fix file_start_write/file_end_write tests]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8a660890 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: kill ki_key

ki_key wasn't actually used for anything previously - it was always 0.
Drop it to trim struct kiocb a bit.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e23bcae 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: give shared kioctx fields their own cachelines

[akpm@linux-foundation.org: make reqs_active __cacheline_aligned_in_smp]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 58c85dc2 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: kill struct aio_ring_info

struct aio_ring_info was kind of odd, the only place it's used is where
it's embedded in struct kioctx - there's no real need for it.

The next patch rearranges struct kioctx and puts various things on their
own cachelines - getting rid of struct aio_ring_info now makes that
reordering a bit clearer.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a1c8eae7 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: kill batch allocation

Previously, allocating a kiocb required touching quite a few global
(well, per kioctx) cachelines... so batching up allocation to amortize
those was worthwhile. But we've gotten rid of some of those, and in
another couple of patches kiocb allocation won't require writing to any
shared cachelines, so that means we can just rip this code out.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3e845ce0 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: change reqs_active to include unreaped completions

The aio code tries really hard to avoid having to deal with the
completion ringbuffer overflowing. To do that, it has to keep track of
the number of outstanding kiocbs, and the number of completions
currently in the ringbuffer - and it's got to check that every time we
allocate a kiocb. Ouch.

But - we can improve this quite a bit if we just change reqs_active to
mean "number of outstanding requests and unreaped completions" - that
means kiocb allocation doesn't have to look at the ringbuffer, which is
a fairly significant win.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0460fef2 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: use cancellation list lazily

Cancelling kiocbs requires adding them to a per kioctx linked list,
which is one of the few things we need to take the kioctx lock for in
the fast path. But most kiocbs can't be cancelled - so if we just do
this lazily, we can avoid quite a bit of locking overhead.

While we're at it, instead of using a flag bit switch to using ki_cancel
itself to indicate that a kiocb has been cancelled/completed. This lets
us get rid of ki_flags entirely.

[akpm@linux-foundation.org: remove buggy BUG()]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 21b40200 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: use flush_dcache_page()

This wasn't causing problems before because it's not needed on x86, but
it is needed on other architectures.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a31ad380 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: make aio_read_evt() more efficient, convert to hrtimers

Previously, aio_read_event() pulled a single completion off the
ringbuffer at a time, locking and unlocking each time. Change it to
pull off as many events as it can at a time, and copy them directly to
userspace.

This also fixes a bug where if copying the event to userspace failed,
we'd lose the event.

Also convert it to wait_event_interruptible_hrtimeout(), which
simplifies it quite a bit.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 36f55889 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: refcounting cleanup

The usage of ctx->dead was fubar - it makes no sense to explicitly check
it all over the place, especially when we're already using RCU.

Now, ctx->dead only indicates whether we've dropped the initial
refcount. The new teardown sequence is:

set ctx->dead
hlist_del_rcu();
synchronize_rcu();

Now we know no system calls can take a new ref, and it's safe to drop
the initial ref:

put_ioctx();

We also need to ensure there are no more outstanding kiocbs. This was
done incorrectly - it was being done in kill_ctx(), and before dropping
the initial refcount. At this point, other syscalls may still be
submitting kiocbs!

Now, we cancel and wait for outstanding kiocbs in free_ioctx(), after
kioctx->users has dropped to 0 and we know no more iocbs could be
submitted.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11599eba 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: make aio_put_req() lockless

Freeing a kiocb needed to touch the kioctx for three things:

* Pull it off the reqs_active list
* Decrementing reqs_active
* Issuing a wakeup, if the kioctx was in the process of being freed.

This patch moves these to aio_complete(), for a couple reasons:

* aio_complete() already has to issue the wakeup, so if we drop the
kioctx refcount before aio_complete does its wakeup we don't have to
do it twice.
* aio_complete currently has to take the kioctx lock, so it makes sense
for it to pull the kiocb off the reqs_active list too.
* A later patch is going to change reqs_active to include unreaped
completions - this will mean allocating a kiocb doesn't have to look
at the ringbuffer. So taking the decrement of reqs_active out of
kiocb_free() is useful prep work for that patch.

This doesn't really affect cancellation, since existing (usb) code that
implements a cancel function still calls aio_complete() - we just have
to make sure that aio_complete does the necessary teardown for cancelled
kiocbs.

It does affect code paths where we free kiocbs that were never
submitted; they need to decrement reqs_active and pull the kiocb off the
reqs_active list. This occurs in two places: kiocb_batch_free(), which
is going away in a later patch, and the error path in io_submit_one.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1d98ebfc 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: do fget() after aio_get_req()

aio_get_req() will fail if we have the maximum number of requests
outstanding, which depending on the application may not be uncommon. So
avoid doing an unnecessary fget().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# caf4167a 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: dprintk() -> pr_debug()

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e179bca 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: move private stuff out of aio.h

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 906b973c 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: add kiocb_cancel()

Minor refactoring, to get rid of some duplicated code

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2d68449e 07-May-2013 Kent Overstreet <koverstreet@google.com>

aio: kill return value of aio_complete()

Nothing used the return value, and it probably wasn't possible to use it
safely for the locked versions (aio_complete(), aio_put_req()). Just
kill it.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41003a7b 07-May-2013 Zach Brown <zab@redhat.com>

aio: remove retry-based AIO

This removes the retry-based AIO infrastructure now that nothing in tree
is using it.

We want to remove retry-based AIO because it is fundemantally unsafe.
It retries IO submission from a kernel thread that has only assumed the
mm of the submitting task. All other task_struct references in the IO
submission path will see the kernel thread, not the submitting task.
This design flaw means that nothing of any meaningful complexity can use
retry-based AIO.

This removes all the code and data associated with the retry machinery.
The most significant benefit of this is the removal of the locking
around the unused run list in the submission path.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 91d80a84 25-Apr-2013 Zhao Hongjiang <zhaohongjiang@huawei.com>

aio: fix possible invalid memory access when DEBUG is enabled

dprintk() shouldn't access @ring after it's unmapped.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d71db4f 19-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

lift sb_start_write/sb_end_write out of ->aio_write()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2cf09666 21-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect

... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded
uses of asmlinkage_protect() in a bunch of syscalls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b67bfe0d 27-Feb-2013 Sasha Levin <sasha.levin@oracle.com>

hlist: drop the node parameter from iterators

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41badc15 22-Feb-2013 Michel Lespinasse <walken@google.com>

mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool

do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
multiple, however there was no equivalent code in mm_populate(), which
caused issues.

This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bebeb3d6 22-Feb-2013 Michel Lespinasse <walken@google.com>

mm: introduce mm_populate() for populating new vmas

When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.

This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().

Signed-off-by: Michel Lespinasse <walken@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3ffa3c0e 24-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: now fput() is OK from interrupt context; get rid of manual delayed __fput()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e3fc629d 30-May-2012 Al Viro <viro@zeniv.linux.org.uk>

switch aio and shm to do_mmap_pgoff(), make do_mmap() static

after all, 0 bytes and 0 pages is the same thing...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ac34ebb3 31-May-2012 Christopher Yeoh <cyeoh@au1.ibm.com>

aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()

A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
changes made to support CMA in an earlier patch.

Rather than having an additional check_access parameter to these
functions, the first paramater type is overloaded to allow the caller to
specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
are valid, but do not check the memory that they point to. This is used
by process_vm_readv/writev where we need to validate that a iovec passed
to the syscall is valid but do not want to check the memory that it points
to at this point because it refers to an address space in another process.

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a70b52ec 21-May-2012 Linus Torvalds <torvalds@linux-foundation.org>

vfs: make AIO use the proper rw_verify_area() area helpers

We had for some reason overlooked the AIO interface, and it didn't use
the proper rw_verify_area() helper function that checks (for example)
mandatory locking on the file, and that the size of the access doesn't
cause us to overflow the provided offset limits etc.

Instead, AIO did just the security_file_permission() thing (that
rw_verify_area() also does) directly.

This fixes it to do all the proper helper functions, which not only
means that now mandatory file locking works with AIO too, we can
actually remove lines of code.

Reported-by: Manish Honap <manish_honap_vit@yahoo.co.in>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bfce281c 20-Apr-2012 Al Viro <viro@zeniv.linux.org.uk>

kill mm argument of vm_munmap()

it's always current->mm

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 936af157 20-Apr-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: don't bother with unmapping when aio_free_ring() is coming from exit_aio()

... since exit_mmap() is coming and it will munmap() everything anyway.
In all other cases aio_free_ring() has ctx->mm == current->mm; moreover,
all other callers of vm_munmap() have mm == current->mm, so this will
allow us to get rid of mm argument of vm_munmap().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a46ef99d 20-Apr-2012 Linus Torvalds <torvalds@linux-foundation.org>

VM: add "vm_munmap()" helper function

Like the vm_brk() function, this is the same as "do_munmap()", except it
does the VM locking for the caller.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a2e1859a 20-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: take final put_ioctx() into callers of io_destroy()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 06af121e 20-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: merge aio_cancel_all() with wait_for_all_aios()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9fcf03d0 13-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: fix the comment in aio_kick_handler()

It should've been changed when queue_work() became
queue_delayed_work(..., 0) in there. It's always had been
about not needing a delay, not about not using specific
function...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cd1ea261 10-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: don't bother with cancel_delayed_work() in exit_aio()

__put_ioctx() will cover it anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# bf50722a 10-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: use cancel_delayed_work_sync()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9fa1cb39 10-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: aio_nr_lock is taken only synchronously now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2dd542b7 10-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: aio_nr decrements don't need to be delayed

we can do that right in __put_ioctx(); as the result, the loop
in ioctx_alloc() can be killed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e23754f8 06-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

aio: don't bother with async freeing on failure in ioctx_alloc()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e8e3c3d6 25-Nov-2011 Cong Wang <amwang@redhat.com>

fs: remove the second argument of k[un]map_atomic()

Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Cong Wang <amwang@redhat.com>


# c7b28555 08-Mar-2012 Al Viro <viro@ZenIV.linux.org.uk>

aio: fix the "too late munmap()" race

Current code has put_ioctx() called asynchronously from aio_fput_routine();
that's done *after* we have killed the request that used to pin ioctx,
so there's nothing to stop io_destroy() waiting in wait_for_all_aios()
from progressing. As the result, we can end up with async call of
put_ioctx() being the last one and possibly happening during exit_mmap()
or elf_core_dump(), neither of which expects stray munmap() being done
to them...

We do need to prevent _freeing_ ioctx until aio_fput_routine() is done
with that, but that's all we care about - neither io_destroy() nor
exit_aio() will progress past wait_for_all_aios() until aio_fput_routine()
does really_put_req(), so the ioctx teardown won't be done until then
and we don't care about the contents of ioctx past that point.

Since actual freeing of these suckers is RCU-delayed, we don't need to
bump ioctx refcount when request goes into list for async removal.
All we need is rcu_read_lock held just over the ->ctx_lock-protected
area in aio_fput_routine().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 86b62a2c 06-Mar-2012 Al Viro <viro@ZenIV.linux.org.uk>

aio: fix io_setup/io_destroy race

Have ioctx_alloc() return an extra reference, so that caller would drop it
on success and not bother with re-grabbing it on failure exit. The current
code is obviously broken - io_destroy() from another thread that managed
to guess the address io_setup() would've returned would free ioctx right
under us; gets especially interesting if aio_context_t * we pass to
io_setup() points to PROT_READ mapping, so put_user() fails and we end
up doing io_destroy() on kioctx another thread has just got freed...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 880641bb 05-Mar-2012 Jeff Moyer <jmoyer@redhat.com>

aio: wake up waiters when freeing unused kiocbs

Bart Van Assche reported a hung fio process when either hot-removing
storage or when interrupting the fio process itself. The (pruned) call
trace for the latter looks like so:

fio D 0000000000000001 0 6849 6848 0x00000004
ffff880092541b88 0000000000000046 ffff880000000000 ffff88012fa11dc0
ffff88012404be70 ffff880092541fd8 ffff880092541fd8 ffff880092541fd8
ffff880128b894d0 ffff88012404be70 ffff880092541b88 000000018106f24d
Call Trace:
schedule+0x3f/0x60
io_schedule+0x8f/0xd0
wait_for_all_aios+0xc0/0x100
exit_aio+0x55/0xc0
mmput+0x2d/0x110
exit_mm+0x10d/0x130
do_exit+0x671/0x860
do_group_exit+0x44/0xb0
get_signal_to_deliver+0x218/0x5a0
do_signal+0x65/0x700
do_notify_resume+0x65/0x80
int_signal+0x12/0x17

The problem lies with the allocation batching code. It will
opportunistically allocate kiocbs, and then trim back the list of iocbs
when there is not enough room in the completion ring to hold all of the
events.

In the case above, what happens is that the pruning back of events ends
up freeing up the last active request and the context is marked as dead,
so it is thus responsible for waking up waiters. Unfortunately, the
code does not check for this condition, so we end up with a hung task.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Cc: <stable@kernel.org> [3.2.x only]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 630d9c47 16-Nov-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

fs: reduce the use of module.h wherever possible

For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include. Fix up any implicit
include dependencies that were being masked by module.h along
the way.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 69e4747e 08-Jan-2012 Gleb Natapov <gleb@redhat.com>

Unused iocbs in a batch should not be accounted as active.

Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are
allocated in a batch during processing of first iocbs. All iocbs in a
batch are automatically added to ctx->active_reqs list and accounted in
ctx->reqs_active.

If one (not the last one) of iocbs submitted by an user fails, further
iocbs are not processed, but they are still present in ctx->active_reqs
and accounted in ctx->reqs_active. This causes process to stuck in a D
state in wait_for_all_aios() on exit since ctx->reqs_active will never
go down to zero. Furthermore since kiocb_batch_free() frees iocb
without removing it from active_reqs list the list become corrupted
which may cause oops.

Fix this by removing iocb from ctx->active_reqs and updating
ctx->reqs_active in kiocb_batch_free().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@kernel.org # 3.2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 080d676d 02-Nov-2011 Jeff Moyer <jmoyer@redhat.com>

aio: allocate kiocbs in batches

In testing aio on a fast storage device, I found that the context lock
takes up a fair amount of cpu time in the I/O submission path. The reason
is that we take it for every I/O submitted (see __aio_get_req). Since we
know how many I/Os are passed to io_submit, we can preallocate the kiocbs
in batches, reducing the number of times we take and release the lock.

In my testing, I was able to reduce the amount of time spent in
_raw_spin_lock_irq by .56% (average of 3 runs). The command I used to
test this was:

aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>

I also tested the patch with various numbers of events passed to
io_submit, and I ran the xfstests aio group of tests to ensure I didn't
break anything.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Daniel Ehrenberg <dehrenberg@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fcf63409 31-Oct-2011 Christopher Yeoh <cyeoh@au1.ibm.com>

Cross Memory Attach

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e91f90bb 22-Mar-2011 Roland Dreier <roland@purestorage.com>

aio: wake all waiters when destroying ctx

The test program below will hang because io_getevents() uses
add_wait_queue_exclusive(), which means the wake_up() in io_destroy() only
wakes up one of the threads. Fix this by using wake_up_all() in the aio
code paths where we want to make sure no one gets stuck.

// t.c -- compile with gcc -lpthread -laio t.c

#include <libaio.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

static const int nthr = 2;

void *getev(void *ctx)
{
struct io_event ev;
io_getevents(ctx, 1, 1, &ev, NULL);
printf("io_getevents returned\n");
return NULL;
}

int main(int argc, char *argv[])
{
io_context_t ctx = 0;
pthread_t thread[nthr];
int i;

io_setup(1024, &ctx);

for (i = 0; i < nthr; ++i)
pthread_create(&thread[i], NULL, getev, ctx);

sleep(1);

io_destroy(ctx);

for (i = 0; i < nthr; ++i)
pthread_join(thread[i], NULL);

return 0;
}

Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cf15900e 02-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

aio: remove request submission batching

This should be useless now that we have on-stack plugging. So lets just
kill it.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 9f5b9425 30-Jun-2010 Shaohua Li <shaohua.li@intel.com>

fs: make aio plug

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 7eaceacc 10-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

block: remove per-queue plugging

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 7137c6bd 25-Feb-2011 Jan Kara <jack@suse.cz>

aio: fix race between io_destroy() and io_submit()

A race can occur when io_submit() races with io_destroy():

CPU1 CPU2
io_submit()
do_io_submit()
...
ctx = lookup_ioctx(ctx_id);
io_destroy()
Now do_io_submit() holds the last reference to ctx.
...
queue new AIO
put_ioctx(ctx) - frees ctx with active AIOs

We solve this issue by checking whether ctx is being destroyed in AIO
submission path after adding new AIO to ctx. Then we are guaranteed that
either io_destroy() waits for new AIO or we see that ctx is being
destroyed and bail out.

Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3bd9a5d7 25-Feb-2011 Nick Piggin <npiggin@gmail.com>

aio: fix rcu ioctx lookup

aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.

lookup_ioctx doesn't implement the rcu lookup pattern properly.
rcu_read_lock does not prevent refcount going to zero, so we might take
a refcount on a zero count ioctx.

Fix the bug by atomically testing for zero refcount before incrementing.

[jack@suse.cz: added comment into the code]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d37adaa1 26-Jan-2011 Tejun Heo <tj@kernel.org>

fs/aio: aio_wq isn't used in memory reclaim path

aio_wq isn't used during memory reclaim. Convert to alloc_workqueue()
without WQ_MEM_RECLAIM. It's possible to use system_wq but given that
the number of work items is determined from userland and the work item
may block, enforcing strict concurrency limit would be a good idea.

Also, move fput_work to system_wq so that aio_wq is used soley to
throttle the max concurrency of aio work items and fput_work doesn't
interact with other work items.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: linux-aio@kvack.org


# 27eaa1c9 13-Dec-2010 Namhyung Kim <namhyung@gmail.com>

aio: check return value of create_workqueue()

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d3486f8b 12-Jan-2011 Jeff Moyer <jmoyer@redhat.com>

aio: remove unused aio_run_iocbs()

aio_run_iocbs() is not used at all, so get rid of it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2e410255 12-Jan-2011 Namhyung Kim <namhyung@gmail.com>

aio: remove unnecessary check

'nr >= min_nr >= 0' always satisfies 'nr >= 0' so the check is unnecesary.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7de9c6ee 23-Oct-2010 Al Viro <viro@zeniv.linux.org.uk>

new helper: ihold()

Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 306fb097 23-Aug-2010 Chris Mason <chris.mason@oracle.com>

aio: bump i_count instead of using igrab

The aio batching code is using igrab to get an extra reference on the
inode so it can safely batch. igrab will go ahead and take the global
inode spinlock, which can be a bottleneck on large machines doing lots
of AIO.

In this case, igrab isn't required because we already have a reference
on the file handle. It is safe to just bump the i_count directly
on the inode.

Benchmarking shows this patch brings IOP/s on tons of flash up by about
2.5X.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a0c42bac 22-Sep-2010 Jan Kara <jack@suse.cz>

aio: do not return ERESTARTSYS as a result of AIO

OCFS2 can return ERESTARTSYS from its write function when the process is
signalled while waiting for a cluster lock (and the filesystem is mounted
with intr mount option). Generally, it seems reasonable to allow
filesystems to return this error code from its IO functions. As we must
not leak ERESTARTSYS (and similar error codes) to userspace as a result of
an AIO operation, we have to properly convert it to EINTR inside AIO code
(restarting the syscall isn't really an option because other AIO could
have been already submitted by the same io_submit syscall).

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 75e1c70f 10-Sep-2010 Jeff Moyer <jmoyer@redhat.com>

aio: check for multiplication overflow in do_io_submit

Tavis Ormandy pointed out that do_io_submit does not do proper bounds
checking on the passed-in iocb array:

       if (unlikely(nr < 0))
               return -EINVAL;

       if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
               return -EFAULT;                      ^^^^^^^^^^^^^^^^^^

The attached patch checks for overflow, and if it is detected, the
number of iocbs submitted is scaled down to a number that will fit in
the long.  This is an ok thing to do, as sys_io_submit is documented as
returning the number of iocbs submitted, so callers should handle a
return value of less than the 'nr' argument passed in.

Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 642b5123 05-Aug-2010 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>

aio: fix wrong subsystem comments

- sys_io_destroy(): acutually return -EINVAL if the context pointed to
is invalidIndex: linux-2.6.33-rc4/fs/aio.c
- sys_io_getevents(): An argument specifying timeout is not `when',
but `timeout'.
- sys_io_getevents(): Should describe what is returned if this syscall
succeeds.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d7065da0 26-May-2010 Al Viro <viro@zeniv.linux.org.uk>

get rid of the magic around f_count in aio

__aio_put_req() plays sick games with file refcount. What
it wants is fput() from atomic context; it's almost always
done with f_count > 1, so they only have to deal with delayed
work in rare cases when their reference happens to be the
last one. Current code decrements f_count and if it hasn't
hit 0, everything is fine. Otherwise it keeps a pointer
to struct file (with zero f_count!) around and has delayed
work do __fput() on it.

Better way to do it: use atomic_long_add_unless( , -1, 1)
instead of !atomic_long_dec_and_test(). IOW, decrement it
only if it's not the last reference, leave refcount alone
if it was. And use normal fput() in delayed work.

I've made that atomic_long_add_unless call a new helper -
fput_atomic(). Drops a reference to file if it's safe to
do in atomic (i.e. if that's not the last one), tells if
it had been able to do that. aio.c converted to it, __fput()
use is gone. req->ki_file *always* contributes to refcount
now. And __fput() became static.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9d85cba7 26-May-2010 Jeff Moyer <jmoyer@redhat.com>

aio: fix the compat vectored operations

The aio compat code was not converting the struct iovecs from 32bit to
64bit pointers, causing either EINVAL to be returned from io_getevents, or
EFAULT as the result of the I/O. This patch passes a compat flag to
io_submit to signal that pointer conversion is necessary for a given iocb
array.

A variant of this was tested by Michael Tokarev. I have also updated the
libaio test harness to exercise this code path with good success.
Further, I grabbed a copy of ltp and ran the
testcases/kernel/syscall/readv and writev tests there (compiled with -m32
on my 64bit system). All seems happy, but extra eyes on this would be
welcome.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org> [2.6.35.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fac046ad 15-Dec-2009 Shaohua Li <shaohua.li@intel.com>

aio: remove unused field

Don't know the reason, but it appears ki_wait field of iocb never gets used.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b9d128f1 29-Oct-2009 Jens Axboe <jens.axboe@oracle.com>

block: move bdi/address_space unplug functions to backing-dev.h

There's nothing block related about them, the backing device
is used by things like NFS etc as well. This gets rid of the
need to protect such calls by CONFIG_BLOCK.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# cfb1e33e 02-Oct-2009 Jeff Moyer <jmoyer@redhat.com>

aio: implement request batching

Hi,

Some workloads issue batches of small I/O, and the performance is poor
due to the call to blk_run_address_space for every single iocb. Nathan
Roberts pointed this out, and suggested that by deferring this call
until all I/Os in the iocb array are submitted to the block layer, we
can realize some impressive performance gains (up to 30% for sequential
4k reads in batches of 16).

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 385773e0 22-Sep-2009 H Hartley Sweeten <hartleys@visionengravers.com>

aio.c: move EXPORT* macros to line after function

As mentioned in Documentation/CodingStyle, move EXPORT* macro's
to the line immediately after the closing function brace line.

Also, move the __initcall() similarly.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d2d827f 21-Sep-2009 Michael S. Tsirkin <mst@redhat.com>

mm: move use_mm/unuse_mm from aio.c to mm/

Anyone who wants to do copy to/from user from a kernel thread, needs
use_mm (like what fs/aio has). Move that into mm/, to make reusing and
exporting easier down the line, and make aio use it. Next intended user,
besides aio, will be vhost-net.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 13389010 30-Jun-2009 Davide Libenzi <davidel@xmailserver.org>

eventfd: revised interface and cleanups

Change the eventfd interface to de-couple the eventfd memory context, from
the file pointer instance.

Without such change, there is no clean way to racely free handle the
POLLHUP event sent when the last instance of the file* goes away. Also,
now the internal eventfd APIs are using the eventfd context instead of the
file*.

This patch is required by KVM's IRQfd code, which is still under
development.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 65c24491 18-Mar-2009 Jeff Moyer <jmoyer@redhat.com>

aio: lookup_ioctx can return the wrong value when looking up a bogus context

The libaio test harness turned up a problem whereby lookup_ioctx on a
bogus io context was returning the 1 valid io context from the list
(harness/cases/3.p).

Because of that, an extra put_iocontext was done, and when the process
exited, it hit a BUG_ON in the put_iocontext macro called from exit_aio
(since we expect a users count of 1 and instead get 0).

The problem was introduced by "aio: make the lookup_ioctx() lockless"
(commit abf137dd7712132ee56d5b3143c2ff61a72a5faa).

Thanks to Zach for pointing out that hlist_for_each_entry_rcu will not
return with a NULL tpos at the end of the loop, even if the entry was
not found.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 87c3a86e 18-Mar-2009 Davide Libenzi <davidel@xmailserver.org>

eventfd: remove fput() call from possible IRQ context

Remove a source of fput() call from inside IRQ context. Myself, like Eric,
wasn't able to reproduce an fput() call from IRQ context, but Jeff said he was
able to, with the attached test program. Independently from this, the bug is
conceptually there, so we might be better off fixing it. This patch adds an
optimization similar to the one we already do on ->ki_filp, on ->ki_eventfd.
Playing with ->f_count directly is not pretty in general, but the alternative
here would be to add a brand new delayed fput() infrastructure, that I'm not
sure is worth it.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 002c8976 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] System call wrappers part 16

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# abf137dd 09-Dec-2008 Jens Axboe <jens.axboe@oracle.com>

aio: make the lookup_ioctx() lockless

The mm->ioctx_list is currently protected by a reader-writer lock,
so we always grab that lock on the read side for doing ioctx
lookups. As the workload is extremely reader biased, turn this into
an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
the rwlock and use a spinlock for providing update side exclusion.

There's usually only 1 entry on this list, so it doesn't make sense
to look into fancier data structures.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 516e0cc5 25-Jul-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] f_count may wrap around

make it atomic_long_t; while we are at it, get rid of useless checks in affs,
hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 246bb0b1 25-Jul-2008 Oleg Nesterov <oleg@tv-sign.ru>

kill PF_BORROWED_MM in favour of PF_KTHREAD

Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and
do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.

No functional changes yet. But this allows us to do further
fixes/cleanups.

oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
kthreads, this is wrong because of use_mm(). The problem with
PF_BORROWED_MM is that we need task_lock() to avoid races. With this
patch we can check PF_KTHREAD directly, or use a simple lockless helper:

/* The result must not be dereferenced !!! */
struct mm_struct *__get_task_mm(struct task_struct *tsk)
{
if (tsk->flags & PF_KTHREAD)
return NULL;
return tsk->mm;
}

Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel
thread without PF_BORROWED_MM.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aab2545f 06-Jun-2008 Oleg Nesterov <oleg@tv-sign.ru>

uml: activate_mm: remove the dead PF_BORROWED_MM check

use_mm() was changed to use switch_mm() instead of activate_mm(), since
then nobody calls (and nobody should call) activate_mm() with
PF_BORROWED_MM bit set.

As Jeff Dike pointed out, we can also remove the "old != new" check, it is
always true.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c6f3a97f 30-Apr-2008 Thomas Gleixner <tglx@linutronix.de>

debugobjects: add timer specific object debugging code

Add calls to the generic object debugging infrastructure and provide fixup
functions which allow to keep the system alive when recoverable problems have
been detected by the object debugging core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 39fa0031 29-Apr-2008 Jeff Moyer <jmoyer@redhat.com>

aio: fix misleading comments

The FIXME comments are inaccurate.
The locking comment over lookup_ioctx() is wrong.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 801678c5 29-Apr-2008 Hirofumi Nakagawa <hnakagawa@miraclelinux.com>

Remove duplicated unlikely() in IS_ERR()

Some drivers have duplicated unlikely() macros. IS_ERR() already has
unlikely() in itself.

This patch cleans up such pointless code.

Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jeff Garzik <jeff@garzik.org>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d5470b59 29-Apr-2008 Adrian Bunk <bunk@kernel.org>

fs/aio.c: make 3 functions static

Make the following needlessly global functions static:

- __put_ioctx()
- lookup_ioctx()
- io_submit_one()

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e92adcba 28-Apr-2008 Jeff Moyer <jmoyer@redhat.com>

aio: io_getevents() should return if io_destroy() is invoked

This patch wakes up a thread waiting in io_getevents if another thread
destroys the context. This was tested using a small program that spawns a
thread to wait in io_getevents while the parent thread destroys the io context
and then waits for the getevents thread to exit. Without this patch, the
program hangs indefinitely. With the patch, the program exits as expected.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Christopher Smith <x@xman.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d1c98b0 10-Apr-2008 Davide Libenzi <davidel@xmailserver.org>

eventfd/kaio integration fix

Jeff Roberson discovered a race when using kaio eventfd based notifications.
When it occurs it can lead tomissed wakeups and hung userspace.

This patch fixes the race by moving the notification inside the spinlocked
section of kaio. The operation is safe since eventfd spinlock and kaio one
are unrelated.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jeff Roberson <jroberson@chesapeake.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 598af051 10-Apr-2008 Roland McGrath <roland@redhat.com>

asmlinkage_protect sys_io_getevents

Use asmlinkage_protect in sys_io_getevents, because GCC for i386 with
CONFIG_FRAME_POINTER=n can decide to clobber an argument word on the
stack, i.e. the user struct pt_regs. Here the problem is not a tail
call, but just the compiler's use of the stack when it inlines and
optimizes the body of the called function. This seems to avoid it.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6cb2a210 19-Mar-2008 Quentin Barnes <qbarnes+linux@yahoo-inc.com>

aio: bad AIO race in aio_complete() leads to process hang

My group ran into a AIO process hang on a 2.6.24 kernel with the process
sleeping indefinitely in io_getevents(2) waiting for the last wakeup to come
and it never would.

We ran the tests on x86_64 SMP. The hang only occurred on a Xeon box
("Clovertown") but not a Core2Duo ("Conroe"). On the Xeon, the L2 cache isn't
shared between all eight processors, but is L2 is shared between between all
two processors on the Core2Duo we use.

My analysis of the hang is if you go down to the second while-loop
in read_events(), what happens on processor #1:
1) add_wait_queue_exclusive() adds thread to ctx->wait
2) aio_read_evt() to check tail
3) if aio_read_evt() returned 0, call [io_]schedule() and sleep

In aio_complete() with processor #2:
A) info->tail = tail;
B) waitqueue_active(&ctx->wait)
C) if waitqueue_active() returned non-0, call wake_up()

The way the code is written, step 1 must be seen by all other processors
before processor 1 checks for pending events in step 2 (that were recorded by
step A) and step A by processor 2 must be seen by all other processors
(checked in step 2) before step B is done.

The race I believed I was seeing is that steps 1 and 2 were
effectively swapped due to the __list_add() being delayed by the L2
cache not shared by some of the other processors. Imagine:
proc 2: just before step A
proc 1, step 1: adds to ctx->wait, but is not visible by other processors yet
proc 1, step 2: checks tail and sees no pending events
proc 2, step A: updates tail
proc 1, step 3: calls [io_]schedule() and sleeps
proc 2, step B: checks ctx->wait, but sees no one waiting, skips wakeup
so proc 1 sleeps indefinitely

My patch adds a memory barrier between steps A and B. It ensures that the
update in step 1 gets seen on processor 2 before continuing. If processor 1
was just before step 1, the memory barrier makes sure that step A (update
tail) gets seen by the time processor 1 makes it to step 2 (check tail).

Before the patch our AIO process would hang virtually 100% of the time. After
the patch, we have yet to see the process ever hang.

Signed-off-by: Quentin Barnes <qbarnes+linux@yahoo-inc.com>
Reviewed-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: <stable@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ We should probably disallow that "if (waitqueue_active()) wake_up()"
coding pattern, because it's so often buggy wrt memory ordering ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c2ec6682 08-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

aio: negative offset should return -EINVAL

An AIO read or write should return -EINVAL if the offset is negative.
This check matches the one in pread and pwrite.

This was found by the libaio test suite.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7adfa2ff 08-Feb-2008 Rusty Russell <rusty@rustcorp.com.au>

aio: partial write should not return error code

When an AIO write gets an error after writing some data (eg. ENOSPC), it
should return the amount written already, not the error. Just like write()
is supposed to.

This was found by the libaio test suite.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-By: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fc9b52cd 08-Feb-2008 Harvey Harrison <harvey.harrison@gmail.com>

fs: remove fastcall, it is always empty

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 56c4da45 30-Jan-2008 Harvey Harrison <harvey.harrison@gmail.com>

core: remove last users of empty FASTCALL macro

FASTCALL is always empty after the x86 removal.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# e00ba3da 05-Dec-2007 Jeff Moyer <jmoyer@redhat.com>

aio: only account I/O wait time in read_events if there are active requests

On 2.6.24, top started showing 100% iowait on one CPU when a UML instance was
running (but completely idle). The UML code sits in io_getevents waiting for
an event to be submitted and completed.

Fix this by checking ctx->reqs_active before scheduling to determine whether
or not we are waiting for I/O.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6212e3a3 18-Oct-2007 Alexey Dobriyan <adobriyan@gmail.com>

Remove struct task_struct::io_wait

Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b
during 2.6.9 development. Commit introduced io_wait field which remained
write-only than and still remains write-only.

Also garbage collect macros which "use" io_wait.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 41d10da3 17-Oct-2007 Jeff Moyer <jmoyer@redhat.com>

aio: account I/O wait time properly

Some months back I proposed changing the schedule() call in
read_events to an io_schedule():
http://osdir.com/ml/linux.kernel.aio.general/2006-10/msg00024.html
This was rejected as there are AIO operations that do not initiate
disk I/O. I've had another look at the problem, and the only AIO
operation that will not initiate disk I/O is IOCB_CMD_NOOP. However,
this command isn't even wired up!

Given that it doesn't work, and hasn't for *years*, I'm going to
suggest again that we do proper I/O accounting when using AIO.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 87e2831c 08-Oct-2007 Yan Zheng <yanzheng@21cn.com>

AIO: fix cleanup in io_submit_one(...)

When IOCB_FLAG_RESFD flag is set and iocb->aio_resfd is incorrect,
statement 'goto out_put_req' is executed. At label 'out_put_req',
aio_put_req(..) is called, which requires 'req->ki_filp' set.

Signed-off-by: Yan Zheng<yanzheng@21cn.com>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9c3060be 10-May-2007 Davide Libenzi <davidel@xmailserver.org>

signal/timer/event: KAIO eventfd support example

This is an example about how to add eventfd support to the current KAIO code,
in order to enable KAIO to post readiness events to a pollable fd (hence
compatible with POSIX select/poll). The KAIO code simply signals the eventfd
fd when events are ready, and this triggers a POLLIN in the fd. This patch
uses a reserved for future use member of the struct iocb to pass an eventfd
file descriptor, that KAIO will use to post events every time a request
completes. At that point, an aio_getevents() will return the completed result
to a struct io_event. I made a quick test program to verify the patch, and it
runs fine here:

http://www.xmailserver.org/eventfd-aio-test.c

The test program uses poll(2), but it'd, of course, work with select and epoll
too.

This can allow to schedule both block I/O and other poll-able devices
requests, and wait for results using select/poll/epoll. In a typical
scenario, an application would submit KAIO request using aio_submit(), and
will also use epoll_ctl() on the whole other class of devices (that with the
addition of signals, timers and user events, now it's pretty much complete),
and then would:

epoll_wait(...);
for_each_event {
if (curr_event_is_kaiofd) {
aio_getevents();
dispatch_aio_events();
} else {
dispatch_epoll_event();
}
}

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 28e53bdd 09-May-2007 Oleg Nesterov <oleg@tv-sign.ru>

unify flush_work/flush_work_keventd and rename it to cancel_work_sync

flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
(this was possible from the very beginnig, I missed this). So we can unify
flush_work_keventd and flush_work.

Also, rename flush_work() to cancel_work_sync() and fix all callers.
Perhaps this is not the best name, but "flush_work" is really bad.

(akpm: this is why the earlier patches bypassed maintainers)

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Auke Kok <auke-jan.h.kok@intel.com>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a9df62c7 09-May-2007 Andrew Morton <akpm@osdl.org>

aio: use flush_work()

Migrate AIO over to use flush_work().

Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a31bd5f 06-May-2007 Christoph Lameter <clameter@sgi.com>

KMEM_CACHE(): simplify slab cache creation

This patch provides a new macro

KMEM_CACHE(<struct>, <flags>)

to simplify slab creation. KMEM_CACHE creates a slab with the name of the
struct, with the size of the struct and with the alignment of the struct.
Additional slab flags may be specified if necessary.

Example

struct test_slab {
int a,b,c;
struct list_head;
} __cacheline_aligned_in_smp;

test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)

will create a new slab named "test_slab" of the size sizeof(struct
test_slab) and aligned to the alignment of test slab. If it fails then we
panic.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 28defbea 27-Mar-2007 Zach Brown <zach.brown@oracle.com>

[PATCH] aio: remove bare user-triggerable error printk

The user can generate console output if they cause do_mmap() to fail
during sys_io_setup(). This was seen in a regression test that does
exactly that by spinning calling mmap() until it gets -ENOMEM before
calling io_setup().

We don't need this printk at all, just remove it.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c3762229 10-Feb-2007 Robert P. J. Day <rpjday@mindspring.com>

[PATCH] Transform kmem_cache_alloc()+memset(0) -> kmem_cache_zalloc().

Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
corresponding "kmem_cache_zalloc()" call.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e10a4437 10-Feb-2007 Robert P. J. Day <rpjday@mindspring.com>

[PATCH] Remove final references to deprecated "MAP_ANON" page protection flag

Remove the last vestiges of the long-deprecated "MAP_ANON" page protection
flag: use "MAP_ANONYMOUS" instead.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dee11c23 03-Feb-2007 Ken Chen <kenchen@google.com>

[PATCH] aio: fix buggy put_ioctx call in aio_complete - v2

An AIO bug was reported that sleeping function is being called in softirq
context:

BUG: warning at kernel/mutex.c:132/__mutex_lock_common()
Call Trace:
[<a000000100577b00>] __mutex_lock_slowpath+0x640/0x6c0
[<a000000100577ba0>] mutex_lock+0x20/0x40
[<a0000001000a25b0>] flush_workqueue+0xb0/0x1a0
[<a00000010018c0c0>] __put_ioctx+0xc0/0x240
[<a00000010018d470>] aio_complete+0x2f0/0x420
[<a00000010019cc80>] finished_one_bio+0x200/0x2a0
[<a00000010019d1c0>] dio_bio_complete+0x1c0/0x200
[<a00000010019d260>] dio_bio_end_aio+0x60/0x80
[<a00000010014acd0>] bio_endio+0x110/0x1c0
[<a0000001002770e0>] __end_that_request_first+0x180/0xba0
[<a000000100277b90>] end_that_request_chunk+0x30/0x60
[<a0000002073c0c70>] scsi_end_request+0x50/0x300 [scsi_mod]
[<a0000002073c1240>] scsi_io_completion+0x200/0x8a0 [scsi_mod]
[<a0000002074729b0>] sd_rw_intr+0x330/0x860 [sd_mod]
[<a0000002073b3ac0>] scsi_finish_command+0x100/0x1c0 [scsi_mod]
[<a0000002073c2910>] scsi_softirq_done+0x230/0x300 [scsi_mod]
[<a000000100277d20>] blk_done_softirq+0x160/0x1c0
[<a000000100083e00>] __do_softirq+0x200/0x240
[<a000000100083eb0>] do_softirq+0x70/0xc0

See report: http://marc.theaimsgroup.com/?l=linux-kernel&m=116599593200888&w=2

flush_workqueue() is not allowed to be called in the softirq context.
However, aio_complete() called from I/O interrupt can potentially call
put_ioctx with last ref count on ioctx and triggers bug. It is simply
incorrect to perform ioctx freeing from aio_complete.

The bug is trigger-able from a race between io_destroy() and aio_complete().
A possible scenario:

cpu0 cpu1
io_destroy aio_complete
wait_for_all_aios { __aio_put_req
... ctx->reqs_active--;
if (!ctx->reqs_active)
return;
}
...
put_ioctx(ioctx)

put_ioctx(ctx);
__put_ioctx
bam! Bug trigger!

The real problem is that the condition check of ctx->reqs_active in
wait_for_all_aios() is incorrect that access to reqs_active is not
being properly protected by spin lock.

This patch adds that protective spin lock, and at the same time removes
all duplicate ref counting for each kiocb as reqs_active is already used
as a ref count for each active ioctx. This also ensures that buggy call
to flush_workqueue() in softirq context is eliminated.

Signed-off-by: "Ken Chen" <kenchen@google.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: <stable@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1ebb1101 29-Dec-2006 Zach Brown <zach.brown@oracle.com>

[PATCH] Fix lock inversion aio_kick_handler()

lockdep found a AB BC CA lock inversion in retry-based AIO:

1) The task struct's alloc_lock (A) is acquired in process context with
interrupts enabled. An interrupt might arrive and call wake_up() which
grabs the wait queue's q->lock (B).

2) When performing retry-based AIO the AIO core registers
aio_wake_function() as the wake funtion for iocb->ki_wait. It is called
with the wait queue's q->lock (B) held and then tries to add the iocb to
the run list after acquiring the ctx_lock (C).

3) aio_kick_handler() holds the ctx_lock (C) while acquiring the
alloc_lock (A) via lock_task() and unuse_mm(). Lockdep emits a warning
saying that we're trying to connect the irq-safe q->lock to the
irq-unsafe alloc_lock via ctx_lock.

This fixes the inversion by calling unuse_mm() in the AIO kick handing path
after we've released the ctx_lock. As Ben LaHaise pointed out __put_ioctx
could set ctx->mm to NULL, so we must only access ctx->mm while we have the
lock.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 90aef12e 13-Dec-2006 Jeremy Fitzhardinge <jeremy@goop.org>

[PATCH] Use activate_mm() in fs/aio.c:use_mm()

activate_mm() is not the right thing to be using in use_mm(). It should be
switch_mm().

On normal x86, they're synonymous, but for the Xen patches I'm adding a
hook which assumes that activate_mm is only used the first time a new mm
is used after creation (I have another hook for dealing with dup_mm). I
think this use of activate_mm() is the only place where it could be used
a second time on an mm.

>From a quick look at the other architectures I think this is OK (most
simply implement one in terms of the other), but some are doing some
subtly different stuff between the two.

Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 97d2a805 06-Dec-2006 Benjamin LaHaise <bcrl@kvack.org>

[PATCH] aio: remove ki_retried debugging member

Remove the ki_retried member from struct kiocb. I think the idea was
bounced around a while back, but Arnaldo pointed out another reason that we
should dig it up when he pointed out that the last cacheline of struct
kiocb only contains 4 bytes. By removing the debugging member, we save
more than the 8 byte on 64 bit machines.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b62e8ec2 06-Dec-2006 Kenneth W Chen <kenneth.w.chen@intel.com>

[PATCH] aio: kill pointless ki_nbytes assignment in aio_setup_single_vector

io_submit_one assigns ki_left = ki_nbytes = iocb->aio_nbytes, then calls
down to aio_setup_iocb, then to aio_setup_single_vector. In there,
ki_nbytes is reassigned to the same value it got two call stack above it.
There is no need to do so.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e18b890b 06-Dec-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] slab: remove kmem_cache_t

Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#

set -e

for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done

The script was run like this

sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 93e06b41 29-Nov-2006 Eric Sesterhenn <snakebyte@gmx.de>

BUG_ON conversion for fs/aio.c

This patch converts a if () BUG(); construct to BUG_ON();
which occupies less space, uses unlikely and is safer when
BUG() is disabled.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 65f27f38 22-Nov-2006 David Howells <dhowells@redhat.com>

WorkStruct: Pass the work_struct pointer instead of context data

Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.

For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.

To make this work, an extra flag is introduced into the management side of the
work_struct. This governs auto-release of the structure upon execution.

Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function. This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated.. This is a
problem if the auxiliary data is taken away (as done by the last patch).

However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems. But then the work function must itself release the
work_struct by calling work_release().

In most cases, automatic release is fine, so this is the default. Special
initiators exist for the non-auto-release case (ending in _NAR).


Signed-Off-By: David Howells <dhowells@redhat.com>


# 52bad64d 22-Nov-2006 David Howells <dhowells@redhat.com>

WorkStruct: Separate delayable and non-delayable events.

Separate delayable work items from non-delayable work items be splitting them
into a separate structure (delayed_work), which incorporates a work_struct and
the timer_list removed from work_struct.

The work_struct struct is huge, and this limits it's usefulness. On a 64-bit
architecture it's nearly 100 bytes in size. This reduces that by half for the
non-delayable type of event.

Signed-Off-By: David Howells <dhowells@redhat.com>


# 691578cd 03-Oct-2006 Zach Brown <zach.brown@oracle.com>

[PATCH] pr_debug: aio: use size_t length modifier in pr_debug format arguments

aio: use size_t length modifier in pr_debug format arguments

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# eed4e51f 01-Oct-2006 Badari Pulavarty <pbadari@us.ibm.com>

[PATCH] Add vector AIO support

This work is initially done by Zach Brown to add support for vectored aio.
These are the core changes for AIO to support
IOCB_CMD_PREADV/IOCB_CMD_PWRITEV.

[akpm@osdl.org: huge build fix]
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 027445c3 01-Oct-2006 Badari Pulavarty <pbadari@us.ibm.com>

[PATCH] Vectorize aio_read/aio_write fileop methods

This patch vectorizes aio_read() and aio_write() methods to prepare for
collapsing all aio & vectored operations into one interface - which is
aio_read()/aio_write().

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Michael Holzheu <HOLZHEU@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d6e05edc 26-Jun-2006 Andreas Mohr <andi@lisas.de>

spelling fixes

acquired (aquired)
contiguous (contigious)
successful (succesful, succesfull)
surprise (suprise)
whether (weather)
some other misspellings

Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 626ab0e6 23-Jun-2006 Oleg Nesterov <oleg@tv-sign.ru>

[PATCH] list: use list_replace_init() instead of list_splice_init()

list_splice_init(list, head) does unneeded job if it is known that
list_empty(head) == 1. We can use list_replace_init() instead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 11b0b5ab 25-Mar-2006 Oliver Neukum <neukum@fachschaft.cup.uni-muenchen.de>

[PATCH] use kzalloc and kcalloc in core fs code

Signed-off-by: Oliver Neukum <oliver@neukum.name>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 095975da 08-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] rcu file: use atomic primitives

Use atomic_inc_not_zero for rcu files instead of special case rcuref.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d00689af 13-Nov-2005 Zach Brown <zach.brown@oracle.com>

[PATCH] aio: replace locking comments with assert_spin_locked()

aio: replace locking comments with assert_spin_locked()

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 20dcae32 13-Nov-2005 Zach Brown <zach.brown@oracle.com>

[PATCH] aio: remove kioctx from mm_struct

Sync iocbs have a life cycle that don't need a kioctx. Their retrying, if
any, is done in the context of their owner who has allocated them on the
stack.

The sole user of a sync iocb's ctx reference was aio_complete() checking for
an elevated iocb ref count that could never happen. No path which grabs an
iocb ref has access to sync iocbs.

If we were to implement sync iocb cancelation it would be done by the owner of
the iocb using its on-stack reference.

Removing this chunk from aio_complete allows us to remove the entire kioctx
instance from mm_struct, reducing its size by a third. On a i386 testing box
the slab size went from 768 to 504 bytes and from 5 to 8 per page.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# d55b5fda 07-Nov-2005 Zach Brown <zach.brown@oracle.com>

[PATCH] aio: remove aio_max_nr accounting race

AIO was adding a new context's max requests to the global total before
testing if that resulting total was over the global limit. This let
innocent tasks get their new limit tested along with a racing guilty task
that was crossing the limit. This serializes the _nr accounting with a
spinlock It also switches to using unsigned long for the global totals.
Individual contexts are still limited to an unsigned int's worth of
requests by the syscall interface.

The problem and fix were verified with a simple program that spun creating
and destroying a context while holding on to another long lived context.
Before the patch a task creating a tiny context could get a spurious EAGAIN
if it raced with a task creating a very large context that overran the
limit.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8766ce41 23-Oct-2005 Kostik Belousov <konstantin.belousov@zoral.com.ua>

[PATCH] aio syscalls are not checked by lsm

Another case of missing call to security_file_permission: aio functions
(namely, io_submit) does not check credentials with security modules.

Below is the simple patch to the problem. It seems that it is enough to
check for rights at the request submission time.

Signed-off-by: Kostik Belousov <kostikbel@gmail.com>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4faa5285 17-Oct-2005 Zach Brown <zach.brown@oracle.com>

[PATCH] aio: revert lock_kiocb()

lock_kiocb() was introduced to serialize retrying and cancellation. In the
process of doing so it tried to sleep waiting for KIF_LOCKED while holding
the ctx_lock spinlock. Recent fixes have ensured that multiple concurrent
retries won't be attempted for a given iocb. Cancel has other problems and
has no significant in-tree users that have been complaining about it. So
for the immediate future we'll revert sleeping with the lock held and will
address proper cancellation and retry serialization in the future.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 353fb07e 30-Sep-2005 Zach Brown <zach.brown@oracle.com>

[PATCH] aio: avoid extra aio_{read,write} call when ki_left == 0

Recently aio_p{read,write} changed to perform retries internally rather
than returning -EIOCBRETRY. This inadvertantly resulted in always calling
aio_{read,write} with ki_left at 0 which would in turn immediately return
0. Harmless, but we can avoid this call by checking in the caller.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 897f15fb 30-Sep-2005 Zach Brown <zach.brown@oracle.com>

[PATCH] aio: remove unlocked task_list test and resulting race

Only one of the run or kick path is supposed to put an iocb on the run
list. If both of them do it than one of them can end up referencing a
freed iocb. The kick path could delete the task_list item from the wait
queue before getting the ctx_lock and putting the iocb on the run list.
The run path was testing the task_list item outside the lock so that it
could catch ki_retry methods that return -EIOCBRETRY *without* putting the
iocb on a wait queue and promising to call kick_iocb. This unlocked check
could then race with the kick path to cause both to try and put the iocb on
the run list.

The patch stops the run path from testing task_list by requring that any
ki_retry that returns -EIOCBRETRY *must* guarantee that kick_iocb() will be
called in the future. aio_p{read,write}, the only in-tree -EIOCBRETRY
users, are updated.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 998765e5 30-Sep-2005 Zach Brown <zach.brown@oracle.com>

[PATCH] aio: lock around kiocbTryKick()

Only one of the run or kick path is supposed to put an iocb on the run
list. If both of them do it than one of them can end up referencing a
freed iocb. The kick patch could set the Kicked bit before acquiring the
ctx_lock and putting the iocb on the run list. The run path, while holding
the ctx_lock, could see this partial kick and mistake it for a kick that
was deferred while it was doing work with the run_list NULLed out. It
would then race with the kick thread to add the iocb to the run list.

This patch moves the kick setting under the ctx_lock so that only one of
the kick or run path queues the iocb on the run list, as intended.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a464adeb 16-Sep-2005 Zach Brown <zach.brown@oracle.com>

[PATCH] Add smp_mb__after_clear_bit() to unlock_kiocb()

Add smp_mb__after_clear_bit() to unlock_kiocb()

AIO's use of wait_on_bit_lock()/wake_up_bit() forgot to add a barrier
between clearing its lock bit and calling wake_up_bit() so wake_up_bit()'s
unlocked waitqueue_active() can race. This puts AIO's use in line with the
others and the comment above wake_up_bit().

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ab2af1f5 09-Sep-2005 Dipankar Sarma <dipankar@in.ibm.com>

[PATCH] files: files struct with RCU

Patch to eliminate struct files_struct.file_lock spinlock on the reader side
and use rcu refcounting rcuref_xxx api for the f_count refcounter. The
updates to the fdtable are done by allocating a new fdtable structure and
setting files->fdt to point to the new structure. The fdtable structure is
protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that
are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A
global list of fdtable to be freed is not scalable, so we use a per-cpu list.
If keventd is already handling the current cpu's work, we use a timer to defer
queueing of that work.

Since the last publication, this patch has been re-written to avoid using
explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
premitives instead. This required that the fd information is kept in a
separate structure (fdtable) and updated atomically.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ac0b1bc1 09-Sep-2005 Benjamin LaHaise <bcrl@linux.intel.com>

[PATCH] aio: kiocb locking to serialise retry and cancel

Implement a per-kiocb lock to serialise retry operations and cancel. This
is done using wait_on_bit_lock() on the KIF_LOCKED bit of kiocb->ki_flags.
Also, make the cancellation path lock the kiocb and subsequently release
all references to it if the cancel was successful. This version includes a
fix for the deadlock with __aio_run_iocbs.

Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8f58202b 09-Sep-2005 Wendy Cheng <wcheng@redhat.com>

[PATCH] change io_cancel return code for no cancel case

Note that other than few exceptions, most of the current filesystem and/or
drivers do not have aio cancel specifically defined (kiob->ki_cancel field
is mostly NULL). However, sys_io_cancel system call universally sets
return code to -EAGAIN. This gives applications a wrong impression that
this call is implemented but just never works. We have customer inquires
about this issue.

Changed by Benjamin LaHaise to EINVAL instead of ENOSYS

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1e40cd38 03-Sep-2005 Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

[PATCH] uml: fixes performance regression in activate_mm and thus exec()

Normally, activate_mm() is called from exec(), and thus it used to be a
no-op because we use a completely new "MM context" on the host (for
instance, a new process), and so we didn't need to flush any "TLB entries"
(which for us are the set of memory mappings for the host process from the
virtual "RAM" file).

Kernel threads, instead, are usually handled in a different way. So, when
for AIO we call use_mm(), things used to break and so Benjamin implemented
activate_mm(). However, that is only needed for AIO, and could slow down
exec() inside UML, so be smart: detect being called for AIO (via
PF_BORROWED_MM) and do the full flush only in that situation.

Comment also the caller so that people won't go breaking UML without
noticing. I also rely on the caller's locks for testing current->flags.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
CC: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c016e225 28-Jun-2005 S�bastien Dugu <sebastien.dugue@bull.net>

[PATCH] aio-retry-fix: fix aio retry work queueing

In the case of buffered AIO, in the aio retry path (aio_run_iocb), when the
retry method returns EIOCBRETRY the kicked iocb is added to the context run
list but is never queued onto the work queue. The request therefore is
never completed.

This patch fixes that by adding the appropriate call to aio_queue_work in
aio_run_aiocb so that subsequent retries will be handled by the aio worker
thread.

Signed-off-by: S�bastien Dugu� <sebastien.dugue@bull.net>
Acked-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 954d3e95 01-May-2005 Kenneth W Chen <kenneth.w.chen@intel.com>

[PATCH] aio: optimize io_submit_one()

This patch optimizes io_submit_one to call aio_run_iocb() directly if
ctx->run_list is empty. When the list is empty, the operation of adding to
the list, then call to __aio_run_iocbs() is unnecessary because these
operations are done in one atomic step. ctx->run_list always has only one
element in this case. This optimization speeds up industry standard db
transaction processing benchmark by 0.2%.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 644d3a08 01-May-2005 Kenneth W Chen <kenneth.w.chen@intel.com>

[PATCH] aio: clean up debug code

Clean up code that was previously used for debug purpose. Remove aio_run,
aio_wakeups, iocb->ki_queued and iocb->ki_kicked. Also clean up unused
variable count in __aio_run_iocbs() and debug code in read_events().

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4bf69b2a 01-May-2005 Kenneth W Chen <kenneth.w.chen@intel.com>

[PATCH] aio: ring wrapping simplification

Since the tail pointer in aio_ring structure never wrap ring size more than
once, so a simple compare is sufficient to wrap the index around. This avoid
a more expensive mod operation.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 212079cf 01-May-2005 Kenneth W Chen <kenneth.w.chen@intel.com>

[PATCH] aio: remove superfluous kiocb member initialization

This patch removes superfluous kiocb member initialization in the AIO
allocation and deallocation path. For example, in really_put_req(),
right before kiocb is returned to slab, 5 variables are reset to NULL.
The same variables will be initialized at the kiocb allocation time,
so why bother reset them knowing that they will be set to valid data
at alloc time? Another example: ki_retry is initialized in __aio_get_req,
but is initialized again in io_submit_one.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 25ee7e38 25-Apr-2005 Adrian Bunk <bunk@stusta.de>

[PATCH] fs/aio.c: make some code static

This patch makes some needlessly global code static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!