#
caeb4b0a |
|
31-Mar-2024 |
Kent Overstreet <kent.overstreet@linux.dev> |
aio: Fix null ptr deref in aio_complete() wakeup list_del_init_careful() needs to be the last access to the wait queue entry - it effectively unlocks access. Previously, finish_wait() would see the empty list head and skip taking the lock, and then we'd return - but the completion path would still attempt to do the wakeup after the task_struct pointer had been overwritten. Fixes: 71eb6b6b0ba9 ("fs/aio: obey min_nr when doing wakeups") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-fsdevel/CAHTA-ubfwwB51A5Wg5M6H_rPEQK9pNf8FkAGH=vr=FEkyRrtqw@mail.gmail.com/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/stable/20240331215212.522544-1-kent.overstreet%40linux.dev Link: https://lore.kernel.org/r/20240331215212.522544-1-kent.overstreet@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
961ebd12 |
|
04-Mar-2024 |
Bart Van Assche <bvanassche@acm.org> |
fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversion The first kiocb_set_cancel_fn() argument may point at a struct kiocb that is not embedded inside struct aio_kiocb. With the current code, depending on the compiler, the req->ki_ctx read happens either before the IOCB_AIO_RW test or after that test. Move the req->ki_ctx read such that it is guaranteed that the IOCB_AIO_RW test happens first. Reported-by: Eric Biggers <ebiggers@kernel.org> Cc: Benjamin LaHaise <ben@communityfibre.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Fixes: b820de741ae4 ("fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240304235715.3790858-1-bvanassche@acm.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
28468cbe |
|
04-Mar-2024 |
Bart Van Assche <bvanassche@acm.org> |
Revert "fs/aio: Make io_cancel() generate completions again" Patch "fs/aio: Make io_cancel() generate completions again" is based on the assumption that calling kiocb->ki_cancel() does not complete R/W requests. This is incorrect: the two drivers that call kiocb_set_cancel_fn() callers set a cancellation function that calls usb_ep_dequeue(). According to its documentation, usb_ep_dequeue() calls the completion routine with status -ECONNRESET. Hence this revert. Cc: Benjamin LaHaise <ben@communityfibre.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Reported-by: syzbot+b91eb2ed18f599dd3c31@syzkaller.appspotmail.com Fixes: 54cbc058d86b ("fs/aio: Make io_cancel() generate completions again") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240304182945.3646109-1-bvanassche@acm.org Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
54cbc058 |
|
15-Feb-2024 |
Bart Van Assche <bvanassche@acm.org> |
fs/aio: Make io_cancel() generate completions again The following patch accidentally removed the code for delivering completions for cancelled reads and writes to user space: "[PATCH 04/33] aio: remove retry-based AIO" (https://lore.kernel.org/all/1363883754-27966-5-git-send-email-koverstreet@google.com/) >From that patch: - if (kiocbIsCancelled(iocb)) { - ret = -EINTR; - aio_complete(iocb, ret, 0); - /* must not access the iocb after this */ - goto out; - } This leads to a leak in user space of a struct iocb. Hence this patch that restores the code that reports to user space that a read or write has been cancelled successfully. Fixes: 41003a7bcfed ("aio: remove retry-based AIO") Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240215204739.2677806-3-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
b820de74 |
|
15-Feb-2024 |
Bart Van Assche <bvanassche@acm.org> |
fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the following kernel warning appears: WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8 Call trace: kiocb_set_cancel_fn+0x9c/0xa8 ffs_epfile_read_iter+0x144/0x1d0 io_read+0x19c/0x498 io_issue_sqe+0x118/0x27c io_submit_sqes+0x25c/0x5fc __arm64_sys_io_uring_enter+0x104/0xab0 invoke_syscall+0x58/0x11c el0_svc_common+0xb4/0xf4 do_el0_svc+0x2c/0xb0 el0_svc+0x2c/0xa4 el0t_64_sync_handler+0x68/0xb4 el0t_64_sync+0x1a4/0x1a8 Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is submitted by libaio. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
9d5b9475 |
|
20-Nov-2023 |
Joel Granados <j.granados@samsung.com> |
fs: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel elements ctl_table struct. Special attention was placed in making sure that an empty directory for fs/verity was created when CONFIG_FS_VERITY_BUILTIN_SIGNATURES is not defined. In this case we use the register sysctl call that expects a size. Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
#
9fd7874c |
|
04-Dec-2023 |
Jens Axboe <axboe@kernel.dk> |
iov_iter: replace import_single_range() with import_ubuf() With the removal of the 'iov' argument to import_single_range(), the two functions are now fully identical. Convert the import_single_range() callers to import_ubuf(), and remove the former fully. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
6ac805d1 |
|
04-Dec-2023 |
Jens Axboe <axboe@kernel.dk> |
iov_iter: remove unused 'iov' argument from import_single_range() It is entirely unused, just get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-2-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
71eb6b6b |
|
22-Nov-2023 |
Kent Overstreet <kent.overstreet@linux.dev> |
fs/aio: obey min_nr when doing wakeups I've been observing workloads where IPIs due to wakeups in aio_complete() are ~15% of total CPU time in the profile. Most of those wakeups are unnecessary when completion batching is in use in io_getevents(). This plumbs min_nr through via the wait eventry, so that aio_complete() can avoid doing unnecessary wakeups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20231122234257.179390-1-kent.overstreet@linux.dev Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Christian Brauner <brauner@kernel.org> Cc: <linux-aio@kvack.org> Cc: <linux-fsdevel@vger.kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
3652117f |
|
22-Nov-2023 |
Christian Brauner <brauner@kernel.org> |
eventfd: simplify eventfd_signal() Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
600f111e |
|
17-Nov-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
fs: Rename mapping private members It is hard to find where mapping->private_lock, mapping->private_list and mapping->private_data are used, due to private_XXX being a relatively common name for variables and structure members in the kernel. To fit with other members of struct address_space, rename them all to have an i_ prefix. Tested with an allmodconfig build. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
db7fcc88 |
|
15-Sep-2023 |
Kees Cook <keescook@chromium.org> |
aio: Annotate struct kioctx_table with __counted_by Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct kioctx_table. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-aio@kvack.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Message-Id: <20230915201413.never.881-kees@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
592b5fad |
|
12-Jun-2023 |
Yu-cheng Yu <yu-cheng.yu@intel.com> |
mm: Re-introduce vm_flags to do_mmap() There was no more caller passing vm_flags to do_mmap(), and vm_flags was removed from the function's input by: commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()"). There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to do_mmap(). Thus, re-introduce vm_flags to do_mmap(). Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/all/20230613001108.3040476-5-rick.p.edgecombe%40intel.com
|
#
8c3cfa80 |
|
17-Aug-2023 |
Amir Goldstein <amir73il@gmail.com> |
aio: use kiocb_{start,end}_write() helpers Use helpers instead of the open coded dance to silence lockdep warnings. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-6-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
5c075c5b |
|
09-Jun-2023 |
Fabio M. De Francesco <fmdefrancesco@gmail.com> |
fs/aio: Stop allocating aio rings from HIGHMEM There is no need to allocate aio rings from HIGHMEM because of very little memory needed here. Therefore, use GFP_USER flag in find_or_create_page() and get rid of kmap*() mappings. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ira Weiny <ira.weiny@intel.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Message-Id: <20230609145937.17610-1-fmdefrancesco@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
1c71222e |
|
26-Jan-2023 |
Suren Baghdasaryan <surenb@google.com> |
mm: replace vma->vm_flags direct modifications with modifier calls Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
81e9d6f8 |
|
30-Jan-2023 |
Seth Jenkins <sethjenkins@google.com> |
aio: fix mremap after fork null-deref Commit e4a0d3e720e7 ("aio: Make it possible to remap aio ring") introduced a null-deref if mremap is called on an old aio mapping after fork as mm->ioctx_table will be set to NULL. [jmoyer@redhat.com: fix 80 column issue] Link: https://lkml.kernel.org/r/x49sffq4nvg.fsf@segfault.boston.devel.redhat.com Fixes: e4a0d3e720e7 ("aio: Make it possible to remap aio ring") Signed-off-by: Seth Jenkins <sethjenkins@google.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Jann Horn <jannh@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
de4eda9d |
|
15-Sep-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
use less confusing names for iov_iter direction initializers READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
38ace0d5 |
|
14-Jul-2022 |
Uros Bizjak <ubizjak@gmail.com> |
aio: use atomic_try_cmpxchg in __get_reqs_available Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old in __get_reqs_available. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications. No functional change intended. Link: https://lkml.kernel.org/r/20220714164851.3055-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
3648951c |
|
06-Jun-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
aio: Convert to migrate_folio Use a folio throughout this function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
164f4064 |
|
22-May-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
keep iocb_flags() result cached in struct file * calculate at the time we set FMODE_OPENED (do_dentry_open() for normal opens, alloc_file() for pipe()/socket()/etc.) * update when handling F_SETFL * keep in a new field - file->f_iocb_flags; since that thing is needed only before the refcount reaches zero, we can put it into the same anon union where ->f_rcuhead and ->f_llist live - those are used only after refcount reaches zero. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
61e02cdb |
|
14-Mar-2022 |
Lukas Bulwahn <lukas.bulwahn@gmail.com> |
aio: drop needless assignment in aio_read() Commit 84c4e1f89fef ("aio: simplify - and fix - fget/fput for io_submit()") refactored aio_read() and some error cases into early return, which made some intermediate assignment of the return variable needless. Drop this needless assignment in aio_read(). No functional change. No change in resulting object code. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
41d36a9f |
|
07-Mar-2022 |
Christoph Hellwig <hch@lst.de> |
fs: remove kiocb.ki_hint This field is entirely unused now except for a tracepoint in f2fs, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220308060529.736277-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
46de8b97 |
|
09-Feb-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio This is a mechanical change. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
|
#
86b12b6c |
|
21-Jan-2022 |
Xiaoming Ni <nixiaoming@huawei.com> |
aio: move aio sysctl to aio.c The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain. To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic. Move aio sysctl to aio.c and use the new register_sysctl_init() to register the sysctl interface for aio. [mcgrof@kernel.org: adjust commit log to justify the move] Link: https://lkml.kernel.org/r/20211123202347.818157-9-mcgrof@kernel.org Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Qing Wang <wangqing@vivo.com> Cc: Sebastian Reichel <sre@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Stephen Kitt <steve@sk2.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Antti Palosaari <crope@iki.fi> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Clemens Ladisch <clemens@ladisch.de> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Mark Fasheh <mark@fasheh.com> Cc: Phillip Potter <phil@philpotter.co.uk> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: James E.J. Bottomley <jejb@linux.ibm.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4b374986 |
|
13-Sep-2021 |
Xie Yongji <xieyongji@bytedance.com> |
aio: Fix incorrect usage of eventfd_signal_allowed() We should defer eventfd_signal() to the workqueue when eventfd_signal_allowed() return false rather than return true. Fixes: b542e383d8c0 ("eventfd: Make signal recursion protection a task bit") Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Link: https://lore.kernel.org/r/20210913111928.98-1-xieyongji@bytedance.com Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
|
#
50252e4b |
|
08-Dec-2021 |
Eric Biggers <ebiggers@google.com> |
aio: fix use-after-free due to missing POLLFREE handling signalfd_poll() and binder_poll() are special in that they use a waitqueue whose lifetime is the current task, rather than the struct file as is normally the case. This is okay for blocking polls, since a blocking poll occurs within one task; however, non-blocking polls require another solution. This solution is for the queue to be cleared before it is freed, by sending a POLLFREE notification to all waiters. Unfortunately, only eventpoll handles POLLFREE. A second type of non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't handle POLLFREE. This allows a use-after-free to occur if a signalfd or binder fd is polled with aio poll, and the waitqueue gets freed. Fix this by making aio poll handle POLLFREE. A patch by Ramji Jiyani <ramjiyani@google.com> (https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com) tried to do this by making aio_poll_wake() always complete the request inline if POLLFREE is seen. However, that solution had two bugs. First, it introduced a deadlock, as it unconditionally locked the aio context while holding the waitqueue lock, which inverts the normal locking order. Second, it didn't consider that POLLFREE notifications are missed while the request has been temporarily de-queued. The second problem was solved by my previous patch. This patch then properly fixes the use-after-free by handling POLLFREE in a deadlock-free way. It does this by taking advantage of the fact that freeing of the waitqueue is RCU-delayed, similar to what eventpoll does. Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL") Cc: <stable@vger.kernel.org> # v4.18+ Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
|
#
363bee27 |
|
08-Dec-2021 |
Eric Biggers <ebiggers@google.com> |
aio: keep poll requests on waitqueue until completed Currently, aio_poll_wake() will always remove the poll request from the waitqueue. Then, if aio_poll_complete_work() sees that none of the polled events are ready and the request isn't cancelled, it re-adds the request to the waitqueue. (This can easily happen when polling a file that doesn't pass an event mask when waking up its waitqueue.) This is fundamentally broken for two reasons: 1. If a wakeup occurs between vfs_poll() and the request being re-added to the waitqueue, it will be missed because the request wasn't on the waitqueue at the time. Therefore, IOCB_CMD_POLL might never complete even if the polled file is ready. 2. When the request isn't on the waitqueue, there is no way to be notified that the waitqueue is being freed (which happens when its lifetime is shorter than the struct file's). This is supposed to happen via the waitqueue entries being woken up with POLLFREE. Therefore, leave the requests on the waitqueue until they are actually completed (or cancelled). To keep track of when aio_poll_complete_work needs to be scheduled, use new fields in struct poll_iocb. Remove the 'done' field which is now redundant. Note that this is consistent with how sys_poll() and eventpoll work; their wakeup functions do *not* remove the waitqueue entries. Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL") Cc: <stable@vger.kernel.org> # v4.18+ Link: https://lore.kernel.org/r/20211209010455.42744-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
|
#
6b19b766 |
|
21-Oct-2021 |
Jens Axboe <axboe@kernel.dk> |
fs: get rid of the res2 iocb->ki_complete argument The second argument was only used by the USB gadget code, yet everyone pays the overhead of passing a zero to be passed into aio, where it ends up being part of the aio res2 value. Now that everybody is passing in zero, kill off the extra argument. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
6446c4fb |
|
19-Sep-2021 |
Len Baker <len.baker@gmx.com> |
aio: Prefer struct_size over open coded arithmetic As noted in the "Deprecated Interfaces, Language Features, Attributes, and Conventions" documentation [1], size calculations (especially multiplication) should not be performed in memory allocator (or similar) function arguments due to the risk of them overflowing. This could lead to values wrapping around and a smaller allocation being made than the caller was expecting. Using those allocations could lead to linear overflows of heap memory and other misbehaviors. So, use the struct_size() helper to do the arithmetic instead of the argument "size + size * count" in the kzalloc() function. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments Signed-off-by: Len Baker <len.baker@gmx.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
#
b542e383 |
|
29-Jul-2021 |
Thomas Gleixner <tglx@linutronix.de> |
eventfd: Make signal recursion protection a task bit The recursion protection for eventfd_signal() is based on a per CPU variable and relies on the !RT semantics of spin_lock_irqsave() for protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither disables preemption nor interrupts which allows the spin lock held section to be preempted. If the preempting task invokes eventfd_signal() as well, then the recursion warning triggers. Paolo suggested to protect the per CPU variable with a local lock, but that's heavyweight and actually not necessary. The goal of this protection is to prevent the task stack from overflowing, which can be achieved with a per task recursion protection as well. Replace the per CPU variable with a per task bit similar to other recursion protection bits like task_struct::in_page_owner. This works on both !RT and RT kernels and removes as a side effect the extra per CPU storage. No functional change for !RT kernels. Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx
|
#
14d07113 |
|
29-Apr-2021 |
Brian Geffon <bgeffon@google.com> |
Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio" This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99. As discussed in [1] this commit was a no-op because the mapping type was checked in vma_to_resize before move_vma is ever called. This meant that vm_ops->mremap() would never be called on such mappings. Furthermore, we've since expanded support of MREMAP_DONTUNMAP to non-anonymous mappings, and these special mappings are still protected by the existing check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EINVAL. 1. https://lkml.org/lkml/2020/12/28/2340 Link: https://lkml.kernel.org/r/20210323182520.2712101-2-bgeffon@google.com Signed-off-by: Brian Geffon <bgeffon@google.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Alejandro Colomar <alx.manpages@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Michael S . Tsirkin" <mst@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cd544fd1 |
|
14-Dec-2020 |
Dmitry Safonov <0x7f454c46@gmail.com> |
mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio As kernel expect to see only one of such mappings, any further operations on the VMA-copy may be unexpected by the kernel. Maybe it's being on the safe side, but there doesn't seem to be any expected use-case for this, so restrict it now. Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()") Signed-off-by: Dmitry Safonov <dima@arista.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8a3c84b6 |
|
10-Nov-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
vfs: separate __sb_start_write into blocking and non-blocking helpers Break this function into two helpers so that it's obvious that the trylock versions return a value that must be checked, and the blocking versions don't require that. While we're at it, clean up the return type mismatch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
e8f147dc |
|
03-Nov-2020 |
Thomas Gleixner <tglx@linutronix.de> |
fs: Remove asm/kmap_types.h includes Historical leftovers from the time where kmap() had fixed slots. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Sterba <dsterba@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20201103095856.870272797@linutronix.de
|
#
89cd35c5 |
|
24-Sep-2020 |
Christoph Hellwig <hch@lst.de> |
iov_iter: transparently handle compat iovecs in import_iovec Use in compat_syscall to import either native or the compat iovecs, and remove the now superflous compat_import_iovec. This removes the need for special compat logic in most callers, and the remaining ones can still be simplified by using __import_iovec with a bool compat parameter. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
df561f66 |
|
23-Aug-2020 |
Gustavo A. R. Silva <gustavoars@kernel.org> |
treewide: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
#
45e55300 |
|
07-Aug-2020 |
Peter Collingbourne <pcc@google.com> |
mm: remove unnecessary wrapper function do_mmap_pgoff() The current split between do_mmap() and do_mmap_pgoff() was introduced in commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()") to support MPX. The wrapper function do_mmap_pgoff() always passed 0 as the value of the vm_flags argument to do_mmap(). However, MPX support has subsequently been removed from the kernel and there were no more direct callers of do_mmap(); all calls were going via do_mmap_pgoff(). Simplify the code by removing do_mmap_pgoff() and changing all callers to directly call do_mmap(), which now no longer takes a vm_flags argument. Signed-off-by: Peter Collingbourne <pcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
241cb28e |
|
28-May-2020 |
Gustavo A. R. Silva <gustavoars@kernel.org> |
aio: Replace zero-length array with flexible-array There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
#
9bf5b9eb |
|
10-Jun-2020 |
Christoph Hellwig <hch@lst.de> |
kernel: move use_mm/unuse_mm to kthread.c Patch series "improve use_mm / unuse_mm", v2. This series improves the use_mm / unuse_mm interface by better documenting the assumptions, and my taking the set_fs manipulations spread over the callers into the core API. This patch (of 3): Use the proper API instead. Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de These helpers are only for use with kernel threads, and I will tie them more into the kthread infrastructure going forward. Also move the prototypes to kthread.h - mmu_context.h was a little weird to start with as it otherwise contains very low-level MM bits. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d8ed45c5 |
|
08-Jun-2020 |
Michel Lespinasse <walken@google.com> |
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
530f32fc |
|
14-May-2020 |
Miklos Szeredi <mszeredi@redhat.com> |
aio: fix async fsync creds Avi Kivity reports that on fuse filesystems running in a user namespace asyncronous fsync fails with EOVERFLOW. The reason is that f_ops->fsync() is called with the creds of the kthread performing aio work instead of the creds of the process originally submitting IOCB_CMD_FSYNC. Fuse sends the creds of the caller in the request header and it needs to translate the uid and gid into the server's user namespace. Since the kthread is running in init_user_ns, the translation will fail and the operation returns an error. It can be argued that fsync doesn't actually need any creds, but just zeroing out those fields in the header (as with requests that currently don't take creds) is a backward compatibility risk. Instead of working around this issue in fuse, solve the core of the problem by calling the filesystem with the proper creds. Reported-by: Avi Kivity <avi@scylladb.com> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Fixes: c9582eb0ff7d ("fuse: Fail all requests with invalid uids or gids") Cc: stable@vger.kernel.org # 4.18+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
01d7a356 |
|
03-Feb-2020 |
Jens Axboe <axboe@kernel.dk> |
aio: prevent potential eventfd recursion on poll If we have nested or circular eventfd wakeups, then we can deadlock if we run them inline from our poll waitqueue wakeup handler. It's also possible to have very long chains of notifications, to the extent where we could risk blowing the stack. Check the eventfd recursion count before calling eventfd_signal(). If it's non-zero, then punt the signaling to async context. This is always safe, as it takes us out-of-line in terms of stack and locking context. Cc: stable@vger.kernel.org # 4.19+ Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
3ca47e95 |
|
23-Apr-2019 |
Arnd Bergmann <arnd@arndb.de> |
y2038: remove CONFIG_64BIT_TIME The CONFIG_64BIT_TIME option is defined on all architectures, and can be removed for simplicity now. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
#
97eba80f |
|
20-Aug-2019 |
Guillem Jover <guillem@hadrons.org> |
aio: Fix io_pgetevents() struct __compat_aio_sigset layout This type is used to pass the sigset_t from userland to the kernel, but it was using the kernel native pointer type for the member representing the compat userland pointer to the userland sigset_t. This messes up the layout, and makes the kernel eat up both the userland pointer and the size members into the kernel pointer, and then reads garbage into the kernel sigsetsize. Which makes the sigset_t size consistency check fail, and consequently the syscall always returns -EINVAL. This breaks both libaio and strace on 32-bit userland running on 64-bit kernels. And there are apparently no users in the wild of the current broken layout (at least according to codesearch.debian.org and a brief check over github.com search). So it looks safe to fix this directly in the kernel, instead of either letting userland deal with this permanently with the additional overhead or trying to make the syscall infer what layout userland used, even though this is also being worked around in libaio to temporarily cope with kernels that have not yet been fixed. We use a proper compat_uptr_t instead of a compat_sigset_t pointer. Fixes: 7a074e96dee6 ("aio: implement io_pgetevents") Signed-off-by: Guillem Jover <guillem@hadrons.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
37109694 |
|
18-Jul-2019 |
Keith Busch <kbusch@kernel.org> |
mm: migrate: remove unused mode argument migrate_page_move_mapping() doesn't use the mode argument. Remove it and update callers accordingly. Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b772434b |
|
16-Jul-2019 |
Oleg Nesterov <oleg@redhat.com> |
signal: simplify set_user_sigmask/restore_user_sigmask task->saved_sigmask and ->restore_sigmask are only used in the ret-from- syscall paths. This means that set_user_sigmask() can save ->blocked in ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked was modified. This way the callers do not need 2 sigset_t's passed to set/restore and restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns into the trivial helper which just calls restore_saved_sigmask(). Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
97abc889 |
|
28-Jun-2019 |
Oleg Nesterov <oleg@redhat.com> |
signal: remove the wrong signal_pending() check in restore_user_sigmask() This is the minimal fix for stable, I'll send cleanups later. Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced the visible change which breaks user-space: a signal temporary unblocked by set_user_sigmask() can be delivered even if the caller returns success or timeout. Change restore_user_sigmask() to accept the additional "interrupted" argument which should be used instead of signal_pending() check, and update the callers. Eric said: : For clarity. I don't think this is required by posix, or fundamentally to : remove the races in select. It is what linux has always done and we have : applications who care so I agree this fix is needed. : : Further in any case where the semantic change that this patch rolls back : (aka where allowing a signal to be delivered and the select like call to : complete) would be advantage we can do as well if not better by using : signalfd. : : Michael is there any chance we can get this guarantee of the linux : implementation of pselect and friends clearly documented. The guarantee : that if the system call completes successfully we are guaranteed that no : signal that is unblocked by using sigmask will be delivered? Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Eric Wong <e@80x24.org> Tested-by: Eric Wong <e@80x24.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
87e5e6da |
|
14-May-2019 |
Jens Axboe <axboe@kernel.dk> |
uio: make import_iovec()/compat_import_iovec() return bytes on success Currently these functions return < 0 on error, and 0 for success. Change that so that we return < 0 on error, but number of bytes for success. Some callers already treat the return value that way, others need a slight tweak. Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
52db59df |
|
25-Mar-2019 |
David Howells <dhowells@redhat.com> |
vfs: Convert aio to use the new mount API Convert the aio filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. Signed-off-by: David Howells <dhowells@redhat.com> cc: Benjamin LaHaise <bcrl@kvack.org> cc: linux-aio@kvack.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1f58bb18 |
|
20-May-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
mount_pseudo(): drop 'name' argument, switch to d_make_root() Once upon a time we used to set ->d_name of e.g. pipefs root so that d_path() on pipes would work. These days it's completely pointless - dentries of pipes are not even connected to pipefs root. However, mount_pseudo() had set the root dentry name (passed as the second argument) and callers kept inventing names to pass to it. Including those that didn't *have* any non-root dentries to start with... All of that had been pointless for about 8 years now; it's time to get rid of that cargo-culting... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6af1c849 |
|
04-Apr-2019 |
Wei Yongjun <weiyongjun1@huawei.com> |
aio: use kmem_cache_free() instead of kfree() memory allocated by kmem_cache_alloc() should be freed using kmem_cache_free(), not kfree(). Fixes: fa0ca2aee3be ("deal with get_reqs_available() in aio_get_req() itself") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
18bfb9c6 |
|
03-Apr-2019 |
Dan Carpenter <dan.carpenter@oracle.com> |
aio: Fix an error code in __io_submit_one() This accidentally returns the wrong variable. The "req->ki_eventfd" pointer is NULL so this return success. Fixes: 7316b49c2a11 ("aio: move sanity checks and request allocation to io_submit_one()") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7316b49c |
|
06-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: move sanity checks and request allocation to io_submit_one() makes for somewhat cleaner control flow in __io_submit_one() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
fa0ca2ae |
|
06-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
deal with get_reqs_available() in aio_get_req() itself simplifies the caller Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
74259703 |
|
06-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: move dropping ->ki_eventfd into iocb_destroy() no reason to duplicate that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
958c13ce |
|
06-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
make aio_read()/aio_write() return int that ssize_t is a rudiment of earlier calling conventions; it's been used only to pass 0 and -E... since last autumn. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
af5c72b1 |
|
07-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
Fix aio_poll() races aio_poll() has to cope with several unpleasant problems: * requests that might stay around indefinitely need to be made visible for io_cancel(2); that must not be done to a request already completed, though. * in cases when ->poll() has placed us on a waitqueue, wakeup might have happened (and request completed) before ->poll() returns. * worse, in some early wakeup cases request might end up re-added into the queue later - we can't treat "woken up and currently not in the queue" as "it's not going to stick around indefinitely" * ... moreover, ->poll() might have decided not to put it on any queues to start with, and that needs to be distinguished from the previous case * ->poll() might have tried to put us on more than one queue. Only the first will succeed for aio poll, so we might end up missing wakeups. OTOH, we might very well notice that only after the wakeup hits and request gets completed (all before ->poll() gets around to the second poll_wait()). In that case it's too late to decide that we have an error. req->woken was an attempt to deal with that. Unfortunately, it was broken. What we need to keep track of is not that wakeup has happened - the thing might come back after that. It's that async reference is already gone and won't come back, so we can't (and needn't) put the request on the list of cancellables. The easiest case is "request hadn't been put on any waitqueues"; we can tell by seeing NULL apt.head, and in that case there won't be anything async. We should either complete the request ourselves (if vfs_poll() reports anything of interest) or return an error. In all other cases we get exclusion with wakeups by grabbing the queue lock. If request is currently on queue and we have something interesting from vfs_poll(), we can steal it and complete the request ourselves. If it's on queue and vfs_poll() has not reported anything interesting, we either put it on the cancellable list, or, if we know that it hadn't been put on all queues ->poll() wanted it on, we steal it and return an error. If it's _not_ on queue, it's either been already dealt with (in which case we do nothing), or there's aio_poll_complete_work() about to be executed. In that case we either put it on the cancellable list, or, if we know it hadn't been put on all queues ->poll() wanted it on, simulate what cancel would've done. It's a lot more convoluted than I'd like it to be. Single-consumer APIs suck, and unfortunately aio is not an exception... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2bb874c0 |
|
07-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: store event at final iocb_put() Instead of having aio_complete() set ->ki_res.{res,res2}, do that explicitly in its callers, drop the reference (as aio_complete() used to do) and delay the rest until the final iocb_put(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a9339b78 |
|
07-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: keep io_event in aio_kiocb We want to separate forming the resulting io_event from putting it into the ring buffer. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
833f4154 |
|
11-Mar-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: fold lookup_kiocb() into its sole caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b53119f1 |
|
06-Mar-2019 |
Linus Torvalds <torvalds@linux-foundation.org> |
pin iocb through aio. aio_poll() is not the only case that needs file pinned; worse, while aio_read()/aio_write() can live without pinning iocb itself, the proof is rather brittle and can easily break on later changes. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
84c4e1f8 |
|
03-Mar-2019 |
Linus Torvalds <torvalds@linux-foundation.org> |
aio: simplify - and fix - fget/fput for io_submit() Al Viro root-caused a race where the IOCB_CMD_POLL handling of fget/fput() could cause us to access the file pointer after it had already been freed: "In more details - normally IOCB_CMD_POLL handling looks so: 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll() 2) aio_poll() resolves the descriptor to struct file by req->file = fget(iocb->aio_fildes) 3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that aio_kiocb to 2 (bumps by 1, that is). 4) aio_poll() calls vfs_poll(). After sanity checks (basically, "poll_wait() had been called and only once") it locks the queue. That's what the extra reference to iocb had been for - we know we can safely access it. 5) With queue locked, we check if ->woken has already been set to true (by aio_poll_wake()) and, if it had been, we unlock the queue, drop a reference to aio_kiocb and bugger off - at that point it's a responsibility to aio_poll_wake() and the stuff called/scheduled by it. That code will drop the reference to file in req->file, along with the other reference to our aio_kiocb. 6) otherwise, we see whether we need to wait. If we do, we unlock the queue, drop one reference to aio_kiocb and go away - eventual wakeup (or cancel) will deal with the reference to file and with the other reference to aio_kiocb 7) otherwise we remove ourselves from waitqueue (still under the queue lock), so that wakeup won't get us. No async activity will be happening, so we can safely drop req->file and iocb ourselves. If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb won't get freed under us, so we can do all the checks and locking safely. And we don't touch ->file if we detect that case. However, vfs_poll() most certainly *does* touch the file it had been given. So wakeup coming while we are still in ->poll() might end up doing fput() on that file. That case is not too rare, and usually we are saved by the still present reference from descriptor table - that fput() is not the final one. But if another thread closes that descriptor right after our fget() and wakeup does happen before ->poll() returns, we are in trouble - final fput() done while we are in the middle of a method: Al also wrote a patch to take an extra reference to the file descriptor to fix this, but I instead suggested we just streamline the whole file pointer handling by submit_io() so that the generic aio submission code simply keeps the file pointer around until the aio has completed. Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL") Acked-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d3d6a18d |
|
08-Feb-2019 |
Bart Van Assche <bvanassche@acm.org> |
aio: Fix locking in aio_poll() wake_up_locked() may but does not have to be called with interrupts disabled. Since the fuse filesystem calls wake_up_locked() without disabling interrupts aio_poll_wake() may be called with interrupts enabled. Since the kioctx.ctx_lock may be acquired from IRQ context, all code that acquires that lock from thread context must disable interrupts. Hence change the spin_trylock() call in aio_poll_wake() into a spin_trylock_irqsave() call. This patch fixes the following lockdep complaint: ===================================================== WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected 5.0.0-rc4-next-20190131 #23 Not tainted ----------------------------------------------------- syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: 0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline] 0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline] 0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline] 0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908 and this task is already holding: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline] 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline] 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline] 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908 which would create a new lock dependency: (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.} but this new dependency connects a SOFTIRQ-irq-safe lock: (&(&ctx->ctx_lock)->rlock){..-.} ... which became SOFTIRQ-irq-safe at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline] _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160 spin_lock_irq include/linux/spinlock.h:354 [inline] free_ioctx_users+0x2d/0x4a0 fs/aio.c:610 percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline] percpu_ref_put include/linux/percpu-refcount.h:301 [inline] percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline] percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2486 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline] rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780 __do_softirq+0x266/0x95a kernel/softirq.c:292 run_ksoftirqd kernel/softirq.c:654 [inline] run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646 smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164 kthread+0x357/0x430 kernel/kthread.c:247 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 to a SOFTIRQ-irq-unsafe lock: (&fiq->waitq){+.+.} ... which became SOFTIRQ-irq-unsafe at: ... lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415 fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676 fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687 fuse_send_init fs/fuse/inode.c:989 [inline] fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214 mount_nodev+0x68/0x110 fs/super.c:1392 fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239 legacy_get_tree+0xf2/0x200 fs/fs_context.c:590 vfs_get_tree+0x123/0x450 fs/super.c:1481 do_new_mount fs/namespace.c:2610 [inline] do_mount+0x1436/0x2c40 fs/namespace.c:2932 ksys_mount+0xdb/0x150 fs/namespace.c:3148 __do_sys_mount fs/namespace.c:3162 [inline] __se_sys_mount fs/namespace.c:3159 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fiq->waitq); local_irq_disable(); lock(&(&ctx->ctx_lock)->rlock); lock(&fiq->waitq); <Interrupt> lock(&(&ctx->ctx_lock)->rlock); *** DEADLOCK *** 1 lock held by syz-executor2/13779: #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline] #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline] #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline] #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908 the dependencies between SOFTIRQ-irq-safe lock and the holding lock: -> (&(&ctx->ctx_lock)->rlock){..-.} { IN-SOFTIRQ-W at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline] _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160 spin_lock_irq include/linux/spinlock.h:354 [inline] free_ioctx_users+0x2d/0x4a0 fs/aio.c:610 percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline] percpu_ref_put include/linux/percpu-refcount.h:301 [inline] percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline] percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2486 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline] rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780 __do_softirq+0x266/0x95a kernel/softirq.c:292 run_ksoftirqd kernel/softirq.c:654 [inline] run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646 smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164 kthread+0x357/0x430 kernel/kthread.c:247 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 INITIAL USE at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline] _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160 spin_lock_irq include/linux/spinlock.h:354 [inline] __do_sys_io_cancel fs/aio.c:2052 [inline] __se_sys_io_cancel fs/aio.c:2035 [inline] __x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe } ... key at: [<ffffffff8a574140>] __key.52370+0x0/0x40 ... acquired at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] aio_poll fs/aio.c:1772 [inline] __io_submit_one fs/aio.c:1875 [inline] io_submit_one+0xedf/0x1cf0 fs/aio.c:1908 __do_sys_io_submit fs/aio.c:1953 [inline] __se_sys_io_submit fs/aio.c:1923 [inline] __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: -> (&fiq->waitq){+.+.} { HARDIRQ-ON-W at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415 fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676 fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687 fuse_send_init fs/fuse/inode.c:989 [inline] fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214 mount_nodev+0x68/0x110 fs/super.c:1392 fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239 legacy_get_tree+0xf2/0x200 fs/fs_context.c:590 vfs_get_tree+0x123/0x450 fs/super.c:1481 do_new_mount fs/namespace.c:2610 [inline] do_mount+0x1436/0x2c40 fs/namespace.c:2932 ksys_mount+0xdb/0x150 fs/namespace.c:3148 __do_sys_mount fs/namespace.c:3162 [inline] __se_sys_mount fs/namespace.c:3159 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe SOFTIRQ-ON-W at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415 fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676 fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687 fuse_send_init fs/fuse/inode.c:989 [inline] fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214 mount_nodev+0x68/0x110 fs/super.c:1392 fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239 legacy_get_tree+0xf2/0x200 fs/fs_context.c:590 vfs_get_tree+0x123/0x450 fs/super.c:1481 do_new_mount fs/namespace.c:2610 [inline] do_mount+0x1436/0x2c40 fs/namespace.c:2932 ksys_mount+0xdb/0x150 fs/namespace.c:3148 __do_sys_mount fs/namespace.c:3162 [inline] __se_sys_mount fs/namespace.c:3159 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe INITIAL USE at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415 fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676 fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687 fuse_send_init fs/fuse/inode.c:989 [inline] fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214 mount_nodev+0x68/0x110 fs/super.c:1392 fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239 legacy_get_tree+0xf2/0x200 fs/fs_context.c:590 vfs_get_tree+0x123/0x450 fs/super.c:1481 do_new_mount fs/namespace.c:2610 [inline] do_mount+0x1436/0x2c40 fs/namespace.c:2932 ksys_mount+0xdb/0x150 fs/namespace.c:3148 __do_sys_mount fs/namespace.c:3162 [inline] __se_sys_mount fs/namespace.c:3159 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe } ... key at: [<ffffffff8a60dec0>] __key.43450+0x0/0x40 ... acquired at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] aio_poll fs/aio.c:1772 [inline] __io_submit_one fs/aio.c:1875 [inline] io_submit_one+0xedf/0x1cf0 fs/aio.c:1908 __do_sys_io_submit fs/aio.c:1953 [inline] __se_sys_io_submit fs/aio.c:1923 [inline] __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe stack backtrace: CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline] check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605 check_irq_usage kernel/locking/lockdep.c:1650 [inline] check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline] check_prev_add kernel/locking/lockdep.c:1860 [inline] check_prevs_add kernel/locking/lockdep.c:1968 [inline] validate_chain kernel/locking/lockdep.c:2339 [inline] __lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320 lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144 spin_lock include/linux/spinlock.h:329 [inline] aio_poll fs/aio.c:1772 [inline] __io_submit_one fs/aio.c:1875 [inline] io_submit_one+0xedf/0x1cf0 fs/aio.c:1908 __do_sys_io_submit fs/aio.c:1953 [inline] __se_sys_io_submit fs/aio.c:1923 [inline] __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org> Fixes: e8693bcfa0b4 ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19 Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> [ bvanassche: added a comment ] Reluctantly-Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8dabe724 |
|
06-Jan-2019 |
Arnd Bergmann <arnd@arndb.de> |
y2038: syscalls: rename y2038 compat syscalls A lot of system calls that pass a time_t somewhere have an implementation using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have been reworked so that this implementation can now be used on 32-bit architectures as well. The missing step is to redefine them using the regular SYSCALL_DEFINEx() to get them out of the compat namespace and make it possible to build them on 32-bit architectures. Any system call that ends in 'time' gets a '32' suffix on its name for that version, while the others get a '_time32' suffix, to distinguish them from the normal version, which takes a 64-bit time argument in the future. In this step, only 64-bit architectures are changed, doing this rename first lets us avoid touching the 32-bit architectures twice. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
#
ec51f8ee |
|
05-Feb-2019 |
Mike Marshall <hubcap@omnibond.com> |
aio: initialize kiocb private in case any filesystems expect it. A recent optimization had left private uninitialized. Fixes: 2bc4ca9bb600 ("aio: don't zero entire aio_kiocb aio_get_req()") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
ab41ee68 |
|
28-Dec-2018 |
Jan Kara <jack@suse.cz> |
mm: migrate: drop unused argument of migrate_page_move_mapping() All callers of migrate_page_move_mapping() now pass NULL for 'head' argument. Drop it. Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
875736bb |
|
20-Nov-2018 |
Jens Axboe <axboe@kernel.dk> |
aio: abstract out io_event filler helper Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
88a6f18b |
|
24-Nov-2018 |
Jens Axboe <axboe@kernel.dk> |
aio: split out iocb copy from io_submit_one() In preparation of handing in iocbs in a different fashion as well. Also make it clear that the iocb being passed in isn't modified, by marking it const throughout. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
71ebc6fe |
|
24-Nov-2018 |
Jens Axboe <axboe@kernel.dk> |
aio: use iocb_put() instead of open coding it Replace the percpu_ref_put() + kmem_cache_free() with a call to iocb_put() instead. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
a79d40e9 |
|
04-Dec-2018 |
Jens Axboe <axboe@kernel.dk> |
aio: only use blk plugs for > 2 depth submissions Plugging is meant to optimize submission of a string of IOs, if we don't have more than 2 being submitted, don't bother setting up a plug. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
2bc4ca9b |
|
04-Dec-2018 |
Jens Axboe <axboe@kernel.dk> |
aio: don't zero entire aio_kiocb aio_get_req() It's 192 bytes, fairly substantial. Most items don't need to be cleared, especially not upfront. Clear the ones we do need to clear, and leave the other ones for setup when the iocb is prepared and submitted. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
432c7997 |
|
19-Nov-2018 |
Christoph Hellwig <hch@lst.de> |
aio: separate out ring reservation from req allocation This is in preparation for certain types of IO not needing a ring reserveration. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
bc9bff61 |
|
06-Nov-2018 |
Jens Axboe <axboe@kernel.dk> |
aio: use assigned completion handler We know this is a read/write request, but in preparation for having different kinds of those, ensure that we call the assigned handler instead of assuming it's aio_complete_rq(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
0afa9964 |
|
10-Dec-2018 |
Jeff Moyer <jmoyer@redhat.com> |
aio: fix spectre gadget in lookup_ioctx Matthew pointed out that the ioctx_table is susceptible to spectre v1, because the index can be controlled by an attacker. The below patch should mitigate the attack for all of the aio system calls. Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a538e3ff |
|
10-Dec-2018 |
Jeff Moyer <jmoyer@redhat.com> |
aio: fix spectre gadget in lookup_ioctx Matthew pointed out that the ioctx_table is susceptible to spectre v1, because the index can be controlled by an attacker. The below patch should mitigate the attack for all of the aio system calls. Cc: stable@vger.kernel.org Reported-by: Matthew Wilcox <willy@infradead.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
7a35397f |
|
19-Sep-2018 |
Deepa Dinamani <deepa.kernel@gmail.com> |
io_pgetevents: use __kernel_timespec struct timespec is not y2038 safe. struct __kernel_timespec is the new y2038 safe structure for all syscalls that are using struct timespec. Update io_pgetevents interfaces to use struct __kernel_timespec. sigset_t also has different representations on 32 bit and 64 bit architectures. Hence, we need to support the following different syscalls: New y2038 safe syscalls: (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs) Native 64 bit(unchanged) and native 32 bit : sys_io_pgetevents Compat : compat_sys_io_pgetevents_time64 Older y2038 unsafe syscalls: (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs) Native 32 bit : sys_io_pgetevents_time32 Compat : compat_sys_io_pgetevents Note that io_getevents syscalls do not have a y2038 safe solution. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
#
854a6ed5 |
|
19-Sep-2018 |
Deepa Dinamani <deepa.kernel@gmail.com> |
signal: Add restore_user_sigmask() Refactor the logic to restore the sigmask before the syscall returns into an api. This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during the execution and restored after the execution of the syscall. With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
#
ded653cc |
|
19-Sep-2018 |
Deepa Dinamani <deepa.kernel@gmail.com> |
signal: Add set_user_sigmask() Refactor reading sigset from userspace and updating sigmask into an api. This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during, and restored after, the execution of the syscall. With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll, and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise. Note that the calls to sigprocmask() ignored the return value from the api as the function only returns an error on an invalid first argument that is hardcoded at these call sites. The updated logic uses set_current_blocked() instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
#
154989e4 |
|
22-Nov-2018 |
Christoph Hellwig <hch@lst.de> |
aio: clear IOCB_HIPRI No one is going to poll for aio (yet), so we must clear the HIPRI flag, as we would otherwise send it down the poll queues, where no one will be polling for completions. Signed-off-by: Christoph Hellwig <hch@lst.de> IOCB_HIPRI, not RWF_HIPRI. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
76dc8913 |
|
19-Nov-2018 |
Damien Le Moal <damien.lemoal@wdc.com> |
aio: Fix fallback I/O priority value For cases when the application does not specify aio_reqprio for an aio, fallback to use get_current_ioprio() to obtain the task I/O priority last set using ioprio_set() rather than the hardcoded IOPRIO_CLASS_NONE value. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
53fffe29 |
|
17-Nov-2018 |
Jens Axboe <axboe@kernel.dk> |
aio: fix failure to put the file pointer If the ioprio capability check fails, we return without putting the file pointer. Fixes: d9a08a9e616b ("fs: Add aio iopriority support") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9afc5eee |
|
12-Jul-2018 |
Arnd Bergmann <arnd@arndb.de> |
y2038: globally rename compat_time to old_time32 Christoph Hellwig suggested a slightly different path for handling backwards compatibility with the 32-bit time_t based system calls: Rather than simply reusing the compat_sys_* entry points on 32-bit architectures unchanged, we get rid of those entry points and the compat_time types by renaming them to something that makes more sense on 32-bit architectures (which don't have a compat mode otherwise), and then share the entry points under the new name with the 64-bit architectures that use them for implementing the compatibility. The following types and interfaces are renamed here, and moved from linux/compat_time.h to linux/time32.h: old new --- --- compat_time_t old_time32_t struct compat_timeval struct old_timeval32 struct compat_timespec struct old_timespec32 struct compat_itimerspec struct old_itimerspec32 ns_to_compat_timeval() ns_to_old_timeval32() get_compat_itimerspec64() get_old_itimerspec32() put_compat_itimerspec64() put_old_itimerspec32() compat_get_timespec64() get_old_timespec32() compat_put_timespec64() put_old_timespec32() As we already have aliases in place, this patch addresses only the instances that are relevant to the system call interface in particular, not those that occur in device drivers and other modules. Those will get handled separately, while providing the 64-bit version of the respective interfaces. I'm not renaming the timex, rusage and itimerval structures, as we are still debating what the new interface will look like, and whether we will need a replacement at all. This also doesn't change the names of the syscall entry points, which can be done more easily when we actually switch over the 32-bit architectures to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix. Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
#
e8693bcf |
|
15-Jul-2018 |
Christoph Hellwig <hch@lst.de> |
aio: allow direct aio poll comletions for keyed wakeups If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the ctx_lock without spinning we can just complete the iocb straight from the wakeup callback to avoid a context switch. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Avi Kivity <avi@scylladb.com>
|
#
bfe4037e |
|
16-Jul-2018 |
Christoph Hellwig <hch@lst.de> |
aio: implement IOCB_CMD_POLL Simple one-shot poll through the io_submit() interface. To poll for a file descriptor the application should submit an iocb of type IOCB_CMD_POLL. It will poll the fd for the events specified in the the first 32 bits of the aio_buf field of the iocb. Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the iocb is completed, it will have to be resubmitted. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Avi Kivity <avi@scylladb.com>
|
#
9018ccc4 |
|
24-Jul-2018 |
Christoph Hellwig <hch@lst.de> |
aio: add a iocb refcount This is needed to prevent races caused by the way the ->poll API works. To avoid introducing overhead for other users of the iocbs we initialize it to zero and only do refcount operations if it is non-zero in the completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Avi Kivity <avi@scylladb.com>
|
#
9ba546c0 |
|
11-Jul-2018 |
Christoph Hellwig <hch@lst.de> |
aio: don't expose __aio_sigset in uapi glibc uses a different defintion of sigset_t than the kernel does, and the current version would pull in both. To fix this just do not expose the type at all - this somewhat mirrors pselect() where we do not even have a type for the magic sigmask argument, but just use pointer arithmetics. Fixes: 7a074e96 ("aio: implement io_pgetevents") Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Adrian Reber <adrian@lisas.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d93aa9d8 |
|
09-Jun-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
new wrapper: alloc_file_pseudo() takes inode, vfsmount, name, O_... flags and file_operations and either returns a new struct file (in which case inode reference we held is consumed) or returns ERR_PTR(), in which case no refcounts are altered. converted aio_private_file() and sock_alloc_file() to it Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c9c554f2 |
|
11-Jul-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
alloc_file(): switch to passing O_... flags instead of FMODE_... mode ... so that it could set both ->f_flags and ->f_mode, without callers having to set ->f_flags manually. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a11e1d43 |
|
28-Jun-2018 |
Linus Torvalds <torvalds@linux-foundation.org> |
Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL The poll() changes were not well thought out, and completely unexplained. They also caused a huge performance regression, because "->poll()" was no longer a trivial file operation that just called down to the underlying file operations, but instead did at least two indirect calls. Indirect calls are sadly slow now with the Spectre mitigation, but the performance problem could at least be largely mitigated by changing the "->get_poll_head()" operation to just have a per-file-descriptor pointer to the poll head instead. That gets rid of one of the new indirections. But that doesn't fix the new complexity that is completely unwarranted for the regular case. The (undocumented) reason for the poll() changes was some alleged AIO poll race fixing, but we don't make the common case slower and more complex for some uncommon special case, so this all really needs way more explanations and most likely a fundamental redesign. [ This revert is a revert of about 30 different commits, not reverted individually because that would just be unnecessarily messy - Linus ] Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2739b807 |
|
11-Jun-2018 |
Christoph Hellwig <hch@lst.de> |
aio: only return events requested in poll_mask() for IOCB_CMD_POLL The ->poll_mask() operation has a mask of events that the caller is interested in, but not all implementations might take it into account. Mask the return value to only the requested events, similar to what the poll and epoll code does. Reported-by: Avi Kivity <avi@scylladb.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9a6d9a62 |
|
04-Jun-2018 |
Adam Manzanares <adam.manzanares@wdc.com> |
fs: aio ioprio use ioprio_check_cap ret val Previously the value was ignored. Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d9a08a9e |
|
22-May-2018 |
Adam Manzanares <adam.manzanares@wdc.com> |
fs: Add aio iopriority support This is the per-I/O equivalent of the ioprio_set system call. When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field. This patch depends on block: add ioprio_check_cap function. Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
fc28724d |
|
22-May-2018 |
Adam Manzanares <adam.manzanares@wdc.com> |
fs: Convert kiocb rw_hint from enum to u16 In order to avoid kiocb bloat for per command iopriority support, rw_hint is converted from enum to a u16. Added a guard around ki_hint assignment. Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1da92779 |
|
26-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: sanitize the limit checking in io_submit(2) as it is, the logics in native io_submit(2) is "if asked for more than LONG_MAX/sizeof(pointer) iocbs to submit, don't bother with more than LONG_MAX/sizeof(pointer)" (i.e. 512M requests on 32bit and 1E requests on 64bit) while compat io_submit(2) goes with "stop after the first PAGE_SIZE/sizeof(pointer) iocbs", i.e. 1K or so. Which is * inconsistent * *way* too much in native case * possibly too little in compat one and * wrong anyway, since the natural point where we ought to stop bothering is ctx->nr_events Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
67ba049f |
|
26-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: fold do_io_submit() into callers get rid of insane "copy array of 32bit pointers into an array of native ones" glue. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
95af8496 |
|
26-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: shift copyin of iocb into io_submit_one() Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d2988bd4 |
|
26-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
aio_read_events_ring(): make a bit more readable The logics for 'avail' is * not past the tail of cyclic buffer * no more than asked * not past the end of buffer * not past the end of a page Unobfuscate the last part. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9061d14a |
|
26-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way ... so just make them return 0 when caller does not need to destroy iocb Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3c96c7f4 |
|
28-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: take list removal to (some) callers of aio_complete() We really want iocb out of io_cancel(2) reach before we start tearing it down. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ac060cba |
|
27-May-2018 |
Christoph Hellwig <hch@lst.de> |
aio: add missing break for the IOCB_CMD_FDSYNC case Looks like this got lost in a merge. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1962da0d |
|
20-May-2018 |
Christoph Hellwig <hch@lst.de> |
aio: try to complete poll iocbs without context switch If we can acquire ctx_lock without spinning we can just remove our iocb from the active_reqs list, and thus complete the iocbs from the wakeup context. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
2c14fa83 |
|
20-Mar-2018 |
Christoph Hellwig <hch@lst.de> |
aio: implement IOCB_CMD_POLL Simple one-shot poll through the io_submit() interface. To poll for a file descriptor the application should submit an iocb of type IOCB_CMD_POLL. It will poll the fd for the events specified in the the first 32 bits of the aio_buf field of the iocb. Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the iocb is completed, it will have to be resubmitted. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
888933f8 |
|
23-May-2018 |
Christoph Hellwig <hch@lst.de> |
aio: simplify cancellation With the current aio code there is no need for the magic KIOCB_CANCELLED value, as a cancelation just kicks the driver to queue the completion ASAP, with all actual completion handling done in another thread. Given that both the completion path and cancelation take the context lock there is no need for magic cmpxchg loops either. If we remove iocbs from the active list after calling ->ki_cancel (but with ctx_lock still held), we can also rely on the invariant thay anything found on the list has a ->ki_cancel callback and can be cancelled, further simplifing the code. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
f3a2752a |
|
30-Mar-2018 |
Christoph Hellwig <hch@lst.de> |
aio: simplify KIOCB_KEY handling No need to pass the key field to lookup_iocb to compare it with KIOCB_KEY, as we can do that right after retrieving it from userspace. Also move the KIOCB_KEY definition to aio.c as it is an internal value not used by any other place in the kernel. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
4faa9996 |
|
23-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
fix io_destroy()/aio_complete() race If io_destroy() gets to cancelling everything that can be cancelled and gets to kiocb_cancel() calling the function driver has left in ->ki_cancel, it becomes vulnerable to a race with IO completion. At that point req is already taken off the list and aio_complete() does *NOT* spin until we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds to kiocb_free(), freing req just it gets passed to ->ki_cancel(). Fix is simple - remove from the list after the call of kiocb_cancel(). All instances of ->ki_cancel() already have to cope with the being called with iocb still on list - that's what happens in io_cancel(2). Cc: stable@kernel.org Fixes: 0460fef2a921 "aio: use cancellation list lazily" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
baf10564 |
|
20-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: fix io_destroy(2) vs. lookup_ioctx() race kill_ioctx() used to have an explicit RCU delay between removing the reference from ->ioctx_table and percpu_ref_kill() dropping the refcount. At some point that delay had been removed, on the theory that percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was the wrong kind of RCU delay and it didn't care about rcu_read_lock() used by lookup_ioctx(). As the result, we could get ctx freed right under lookup_ioctx(). Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit RCU grace period when freeing kioctx"); however, that fix is not enough. Suppose io_destroy() from one thread races with e.g. io_setup() from another; CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2 has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the refcount, getting it to 0 and triggering a call of free_ioctx_users(), which proceeds to drop the secondary refcount and once that reaches zero calls free_ioctx_reqs(). That does INIT_RCU_WORK(&ctx->free_rwork, free_ioctx); queue_rcu_work(system_wq, &ctx->free_rwork); and schedules freeing the whole thing after RCU delay. In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the refcount from 0 to 1 and returned the reference to io_setup(). Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get freed until after percpu_ref_get(). Sure, we'd increment the counter before ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it has grabbed the reference, ctx is *NOT* going away until it gets around to dropping that reference. The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss. It's not costlier than what we currently do in normal case, it's safe to call since freeing *is* delayed and it closes the race window - either lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx() fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see the object in question at all. Cc: stable@kernel.org Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7a074e96 |
|
02-May-2018 |
Christoph Hellwig <hch@lst.de> |
aio: implement io_pgetevents This is the io_getevents equivalent of ppoll/pselect and allows to properly mix signals and aio completions (especially with IOCB_CMD_POLL) and atomically executes the following sequence: sigset_t origmask; pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); ret = io_getevents(ctx, min_nr, nr, events, timeout); pthread_sigmask(SIG_SETMASK, &origmask, NULL); Note that unlike many other signal related calls we do not pass a sigmask size, as that would get us to 7 arguments, which aren't easily supported by the syscall infrastructure. It seems a lot less painful to just add a new syscall variant in the unlikely case we're going to increase the sigset size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
a3c0d439 |
|
27-Mar-2018 |
Christoph Hellwig <hch@lst.de> |
aio: implement IOCB_CMD_FSYNC and IOCB_CMD_FDSYNC Simple workqueue offload for now, but prepared for adding a real aio_fsync method if the need arises. Based on an earlier patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
54843f87 |
|
02-May-2018 |
Christoph Hellwig <hch@lst.de> |
aio: refactor read/write iocb setup Don't reference the kiocb structure from the common aio code, and move any use of it into helper specific to the read/write path. This is in preparation for aio_poll support that wants to use the space for different fields. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
92ce4728 |
|
06-Apr-2018 |
Christoph Hellwig <hch@lst.de> |
aio: remove the extra get_file/fput pair in io_submit_one If we release the lockdep write protection token before calling into ->write_iter and thus never access the file pointer after an -EIOCBQUEUED return from ->write_iter or ->read_iter we don't need this extra reference. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
75321b50 |
|
09-Apr-2018 |
Christoph Hellwig <hch@lst.de> |
aio: sanitize ki_list handling Instead of handcoded non-null checks always initialize ki_list to an empty list and use list_empty / list_empty_careful on it. While we're at it also error out on a double call to kiocb_set_cancel_fn instead of ignoring it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
c213dc82 |
|
23-Nov-2017 |
Christoph Hellwig <hch@lst.de> |
aio: remove an outdated BUG_ON and comment in aio_complete These days we don't treat sync iocbs special in the aio completion code as they never use it. Remove the old comment and BUG_ON given that the current definition of is_sync_kiocb makes it impossible to hit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
01a658e1 |
|
14-Dec-2017 |
Christoph Hellwig <hch@lst.de> |
aio: don't print the page size at boot time The page size is in no way related to the aio code, and printing it in the (debug) dmesg at every boot serves no purpose. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
f729863a |
|
14-Mar-2018 |
Tejun Heo <tj@kernel.org> |
fs/aio: Use rcu_work instead of explicit rcu and work item Workqueue now has rcu_work. Use it instead of open-coding rcu -> work item bouncing. Signed-off-by: Tejun Heo <tj@kernel.org>
|
#
d0264c01 |
|
14-Mar-2018 |
Tejun Heo <tj@kernel.org> |
fs/aio: Use RCU accessors for kioctx_table->table[] While converting ioctx index from a list to a table, db446a08c23d ("aio: convert the ioctx list to table lookup v3") missed tagging kioctx_table->table[] as an array of RCU pointers and using the appropriate RCU accessors. This introduces a small window in the lookup path where init and access may race. Mark kioctx_table->table[] with __rcu and use the approriate RCU accessors when using the field. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3") Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.12+
|
#
a6d7cff4 |
|
14-Mar-2018 |
Tejun Heo <tj@kernel.org> |
fs/aio: Add explicit RCU grace period when freeing kioctx While fixing refcounting, e34ecee2ae79 ("aio: Fix a trinity splat") incorrectly removed explicit RCU grace period before freeing kioctx. The intention seems to be depending on the internal RCU grace periods of percpu_ref; however, percpu_ref uses a different flavor of RCU, sched-RCU. This can lead to kioctx being freed while RCU read protected dereferences are still in progress. Fix it by updating free_ioctx() to go through call_rcu() explicitly. v2: Comment added to explain double bouncing. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Fixes: e34ecee2ae79 ("aio: Fix a trinity splat") Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.13+
|
#
6aa7de05 |
|
23-Oct-2017 |
Mark Rutland <mark.rutland@arm.com> |
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
fa2e62a5 |
|
04-Aug-2017 |
Deepa Dinamani <deepa.kernel@gmail.com> |
io_getevents: Use timespec64 to represent timeouts struct timespec is not y2038 safe. Use y2038 safe struct timespec64 to represent timeouts. The system call interface itself will be changed as part of different series. Timeouts will not really need more than 32 bits. But, replacing these with timespec64 helps verification of a y2038 safe kernel by getting rid of timespec internally. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: linux-aio@kvack.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2916ecc0 |
|
08-Sep-2017 |
Jérôme Glisse <jglisse@redhat.com> |
mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Introduce a new migration mode that allow to offload the copy to a device DMA engine. This changes the workflow of migration and not all address_space migratepage callback can support this. This is intended to be use by migrate_vma() which itself is use for thing like HMM (see include/linux/hmm.h). No additional per-filesystem migratepage testing is needed. I disables MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i added comment in those to explain why (part of this patch). The commit message is unclear it should say that any callback that wish to support this new mode need to be aware of the difference in the migration flow from other mode. Some of these callbacks do extra locking while copying (aio, zsmalloc, balloon, ...) and for DMA to be effective you want to copy multiple pages in one DMA operations. But in the problematic case you can not easily hold the extra lock accross multiple call to this callback. Usual flow is: For each page { 1 - lock page 2 - call migratepage() callback 3 - (extra locking in some migratepage() callback) 4 - migrate page state (freeze refcount, update page cache, buffer head, ...) 5 - copy page 6 - (unlock any extra lock of migratepage() callback) 7 - return from migratepage() callback 8 - unlock page } The new mode MIGRATE_SYNC_NO_COPY: 1 - lock multiple pages For each page { 2 - call migratepage() callback 3 - abort in all problematic migratepage() callback 4 - migrate page state (freeze refcount, update page cache, buffer head, ...) } // finished all calls to migratepage() callback 5 - DMA copy multiple pages 6 - unlock all the pages To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a new callback migratepages() (for instance) that deals with multiple pages in one transaction. Because the problematic cases are not important for current usage I did not wanted to complexify this patchset even more for no good reason. Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2a8a9867 |
|
05-Jul-2017 |
Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> |
fs: aio: fix the increment of aio-nr and counting against aio-max-nr Currently, aio-nr is incremented in steps of 'num_possible_cpus() * 8' for io_setup(nr_events, ..) with 'nr_events < num_possible_cpus() * 4': ioctx_alloc() ... nr_events = max(nr_events, num_possible_cpus() * 4); nr_events *= 2; ... ctx->max_reqs = nr_events; ... aio_nr += ctx->max_reqs; .... This limits the number of aio contexts actually available to much less than aio-max-nr, and is increasingly worse with greater number of CPUs. For example, with 64 CPUs, only 256 aio contexts are actually available (with aio-max-nr = 65536) because the increment is 512 in that scenario. Note: 65536 [max aio contexts] / (64*4*2) [increment per aio context] is 128, but make it 256 (double) as counting against 'aio-max-nr * 2': ioctx_alloc() ... if (aio_nr + nr_events > (aio_max_nr * 2UL) || ... goto err_ctx; ... This patch uses the original value of nr_events (from userspace) to increment aio-nr and count against aio-max-nr, which resolves those. Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Reported-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com> Tested-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com> Tested-by: Paul Nguyen <nguyenp@us.ibm.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
91f9943e |
|
29-Aug-2017 |
Christoph Hellwig <hch@lst.de> |
fs: support RWF_NOWAIT for buffered reads This is based on the old idea and code from Milosz Tanski. With the aio nowait code it becomes mostly trivial now. Buffered writes continue to return -EOPNOTSUPP if RWF_NOWAIT is passed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
45d06cf7 |
|
27-Jun-2017 |
Jens Axboe <axboe@kernel.dk> |
fs: add O_DIRECT and aio support for sending down write life time hints Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
b745fafa |
|
20-Jun-2017 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT RWF_NOWAIT informs kernel to bail out if an AIO request will block for reasons such as file allocations, or a writeback triggered, or would block while allocating requests while performing direct I/O. RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags. FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable of returning -EAGAIN if the AIO call will block. This must be set by supporting filesystems in the ->open() call. Filesystems xfs, btrfs and ext4 would be supported in the following patches. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
9830f4be |
|
20-Jun-2017 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
fs: Use RWF_* flags for AIO operations aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will carry the RWF_* flags. We cannot use aio_flags because they are not checked for validity which may break existing applications. Note, the only place RWF_HIPRI comes in effect is dio_await_one(). All the rest of the locations, aio code return -EIOCBQUEUED before the checks for RWF_HIPRI. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
174cd4b1 |
|
02-Feb-2017 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
897ab3e0 |
|
24-Feb-2017 |
Mike Rapoport <rppt@linux.vnet.ibm.com> |
userfaultfd: non-cooperative: add event for memory unmaps When a non-cooperative userfaultfd monitor copies pages in the background, it may encounter regions that were already unmapped. Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely changes in the virtual memory layout. Since there might be different uffd contexts for the affected VMAs, we first should create a temporary representation for the unmap event for each uffd context and then notify them one by one to the appropriate userfault file descriptors. The event notification occurs after the mmap_sem has been released. [arnd@arndb.de: fix nommu build] Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de [mhocko@suse.com: fix nommu build] Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bb7462b6 |
|
20-Feb-2017 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: use helpers for calling f_op->{read,write}_iter() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
#
a12f1ae6 |
|
13-Dec-2016 |
Shaohua Li <shli@fb.com> |
aio: fix lock dep warning lockdep reports a warnning. file_start_write/file_end_write only acquire/release the lock for regular files. So checking the files in aio side too. [ 453.532141] ------------[ cut here ]------------ [ 453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670 [ 453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 453.533011] Modules linked in: [ 453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964 [ 453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 453.533011] ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000 [ 453.533011] ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008 [ 453.533011] ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef [ 453.533011] Call Trace: [ 453.533011] [<ffffffff8196cffb>] dump_stack+0x67/0x9c [ 453.533011] [<ffffffff81091ee1>] __warn+0x111/0x130 [ 453.533011] [<ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0 [ 453.533011] [<ffffffff81091f00>] ? __warn+0x130/0x130 [ 453.533011] [<ffffffff8191b789>] ? blk_finish_plug+0x29/0x60 [ 453.533011] [<ffffffff811205d4>] lock_release+0x434/0x670 [ 453.533011] [<ffffffff8198af94>] ? import_single_range+0xd4/0x110 [ 453.533011] [<ffffffff81322195>] ? rw_verify_area+0x65/0x140 [ 453.533011] [<ffffffff813aa696>] ? aio_write+0x1f6/0x280 [ 453.533011] [<ffffffff813aa6c9>] aio_write+0x229/0x280 [ 453.533011] [<ffffffff813aa4a0>] ? aio_complete+0x640/0x640 [ 453.533011] [<ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0 [ 453.533011] [<ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30 [ 453.533011] [<ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40 [ 453.533011] [<ffffffff812a92be>] ? __might_fault+0x7e/0xf0 [ 453.533011] [<ffffffff813ac9bc>] do_io_submit+0x94c/0xb10 [ 453.533011] [<ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10 [ 453.533011] [<ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270 [ 453.533011] [<ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0 [ 453.533011] [<ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 453.533011] [<ffffffff813acb90>] SyS_io_submit+0x10/0x20 [ 453.533011] [<ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad [ 453.533011] [<ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110 [ 453.533011] ---[ end trace b2fbe664d1cc0082 ]--- Cc: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2456e855 |
|
25-Dec-2016 |
Thomas Gleixner <tglx@linutronix.de> |
ktime: Get rid of the union ktime is a union because the initial implementation stored the time in scalar nanoseconds on 64 bit machine and in a endianess optimized timespec variant for 32bit machines. The Y2038 cleanup removed the timespec variant and switched everything to scalar nanoseconds. The union remained, but become completely pointless. Get rid of the union and just keep ktime_t as simple typedef of type s64. The conversion was done with coccinelle and some manual mopping up. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
|
#
7c0f6ba6 |
|
24-Dec-2016 |
Linus Torvalds <torvalds@linux-foundation.org> |
Replace <asm/uaccess.h> with <linux/uaccess.h> globally This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c00d2c7e |
|
20-Dec-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
move aio compat to fs/aio.c ... and fix the minor buglet in compat io_submit() - native one kills ioctx as cleanup when put_user() fails. Get rid of bogus compat_... in !CONFIG_AIO case, while we are at it - they should simply fail with ENOSYS, same as for native counterparts. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
45063097 |
|
04-Dec-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
don't open-code file_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
70fe2f48 |
|
30-Oct-2016 |
Jan Kara <jack@suse.cz> |
aio: fix freeze protection of aio writes Currently we dropped freeze protection of aio writes just after IO was submitted. Thus aio write could be in flight while the filesystem was frozen and that could result in unexpected situation like aio completion wanting to convert extent type on frozen filesystem. Testcase from Dmitry triggering this is like: for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done & fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \ --runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite Fix the problem by dropping freeze protection only once IO is completed in aio_complete(). Reported-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> [hch: forward ported on top of various VFS and aio changes] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
89319d31 |
|
30-Oct-2016 |
Christoph Hellwig <hch@lst.de> |
fs: remove aio_run_iocb Pass the ABI iocb structure to aio_setup_rw and let it handle the non-vectored I/O case as well. With that and a new helper for the AIO return value handling we can now define new aio_read and aio_write helpers that implement reads and writes in a self-contained way without duplicating too much code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
723c0384 |
|
30-Oct-2016 |
Christoph Hellwig <hch@lst.de> |
fs: remove the never implemented aio_fsync file operation Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0b944d3a |
|
30-Oct-2016 |
Christoph Hellwig <hch@lst.de> |
aio: hold an extra file reference over AIO read/write operations Otherwise we might dereference an already freed file and/or inode when aio_complete is called before we return from the read_iter or write_iter method. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
de04e769 |
|
14-Sep-2016 |
Rasmus Villemoes <linux@rasmusvillemoes.dk> |
fs/aio.c: eliminate redundant loads in put_aio_ring_file Using a local variable we can prevent gcc from reloading aio_ring_file->f_inode->i_mapping twice, eliminating 2x2 dependent loads. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
22f6b4d3 |
|
15-Sep-2016 |
Jann Horn <jann@thejh.net> |
aio: mark AIO pseudo-fs noexec This ensures that do_mmap() won't implicitly make AIO memory mappings executable if the READ_IMPLIES_EXEC personality flag is set. Such behavior is problematic because the security_mmap_file LSM hook doesn't catch this case, potentially permitting an attacker to bypass a W^X policy enforced by SELinux. I have tested the patch on my machine. To test the behavior, compile and run this: #define _GNU_SOURCE #include <unistd.h> #include <sys/personality.h> #include <linux/aio_abi.h> #include <err.h> #include <stdlib.h> #include <stdio.h> #include <sys/syscall.h> int main(void) { personality(READ_IMPLIES_EXEC); aio_context_t ctx = 0; if (syscall(__NR_io_setup, 1, &ctx)) err(1, "io_setup"); char cmd[1000]; sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'", (int)getpid()); system(cmd); return 0; } In the output, "rw-s" is good, "rwxs" is bad. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
013373e8 |
|
23-May-2016 |
Michal Hocko <mhocko@suse.com> |
aio: make aio_setup_ring killable aio_setup_ring waits for mmap_sem in writable mode. If the waiting task gets killed by the oom killer it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. This will also expedite the return to the userspace and do_exit. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Benamin LaHaise <bcrl@kvack.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2958ec17 |
|
31-Mar-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: remove a pointless assignment the value is never used after that point Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5477e70a |
|
04-Sep-2015 |
Oleg Nesterov <oleg@redhat.com> |
mm: move ->mremap() from file_operations to vm_operations_struct vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this way ->mremap() can have more users. Say, vdso. While at it, s/aio_ring_remap/aio_ring_mremap/. Note: this is the minimal change before ->mremap() finds another user in file_operations; this method should have more arguments, and it can be used to kill arch_remap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dc48e56d |
|
15-Apr-2015 |
Jens Axboe <axboe@fb.com> |
aio: fix serial draining in exit_aio() exit_aio() currently serializes killing io contexts. Each context killing ends up having to do percpu_ref_kill(), which in turns has to wait for an RCU grace period. This can take a long time, depending on the number of contexts. And there's no point in doing them serially, when we could be waiting for all of them in one fell swoop. This patches makes my fio thread offload test case exit 0.2s instead of almost 6s. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
2ba48ce5 |
|
09-Apr-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
mirror O_APPEND and O_DIRECT into iocb->ki_flags ... avoiding write_iter/fcntl races. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
84363182 |
|
03-Apr-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
->aio_read and ->aio_write removed no remaining users Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
47e39362 |
|
31-Mar-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
aio_run_iocb(): kill dead check We check if ->ki_pos is positive. However, by that point we have already done rw_verify_area(), which would have rejected such unless the file had been one of /dev/mem, /dev/kmem and /proc/kcore. All of which do not have vectored rw methods, so we would've bailed out even earlier. This check had been introduced before rw_verify_area() had been added there - in fact, it was a subset of checks done on sync paths by rw_verify_area() (back then the /dev/mem exception didn't exist at all). The rest of checks (mandatory locking, etc.) hadn't been added until later. Unfortunately, by the time the call of rw_verify_area() got added, the /dev/mem exception had already appeared, so it wasn't obvious that the older explicit check downstream had become dead code. It *is* a dead code, though, since the few files for which the exception applies do not have ->aio_{read,write}() or ->{read,write}_iter() and for them we won't reach that check anyway. What's more, even if we ever introduce vectored methods for /dev/mem and friends, they'll have to cope with negative positions anyway, since readv(2) and writev(2) are using the same checks as read(2) and write(2) - i.e. rw_verify_area(). Let's bury it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
08397acd |
|
31-Mar-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
ioctx_alloc(): remove pointless check Way, way back kiocb used to be picked from arrays, so ioctx_alloc() checked for multiplication overflow when calculating the size of such array. By the time fs/aio.c went into the tree (in 2002) they were already allocated one-by-one by kmem_cache_alloc(), so that check had already become pointless. Let's bury it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
32a56afa |
|
21-Mar-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
aio_setup_vectored_rw(): switch to {compat_,}import_iovec() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d4fb392f |
|
21-Mar-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
kill aio_setup_single_vector() identical to import_single_range() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a96114fa |
|
20-Mar-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: simplify arguments of aio_setup_..._rw() We don't need req in either of those. We don't need nr_segs in caller. We don't really need len in caller either - iov_iter_count(&iter) will do. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4c185ce0 |
|
20-Mar-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: lift iov_iter_init() into aio_setup_..._rw() the only non-trivial detail is that we do it before rw_verify_area(), so we'd better cap the length ourselves in aio_setup_single_rw() case (for vectored case rw_copy_check_uvector() will do that for us). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
deeb8525 |
|
06-Apr-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
ioctx_alloc(): fix vma (and file) leak on failure If we fail past the aio_setup_ring(), we need to destroy the mapping. We don't need to care about anybody having found ctx, or added requests to it, since the last failure exit is exactly the failure to make ctx visible to lookups. Reproducer (based on one by Joe Mario <jmario@redhat.com>): void count(char *p) { char s[80]; printf("%s: ", p); fflush(stdout); sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid()); system(s); } int main() { io_context_t *ctx; int created, limit, i, destroyed; FILE *f; count("before"); if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL) perror("opening aio-max-nr"); else if (fscanf(f, "%d", &limit) != 1) fprintf(stderr, "can't parse aio-max-nr\n"); else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL) perror("allocating aio_context_t array"); else { for (i = 0, created = 0; i < limit; i++) { if (io_setup(1000, ctx + created) == 0) created++; } for (i = 0, destroyed = 0; i < created; i++) if (io_destroy(ctx[i]) == 0) destroyed++; printf("created %d, failed %d, destroyed %d\n", created, limit - created, destroyed); count("after"); } } Found-by: Joe Mario <jmario@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b2edffdd |
|
06-Apr-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
fix mremap() vs. ioctx_kill() race teach ->mremap() method to return an error and have it fail for aio mappings in process of being killed Note that in case of ->mremap() failure we need to undo move_page_tables() we'd already done; we could call ->mremap() first, but then the failure of move_page_tables() would require undoing whatever _successful_ ->mremap() has done, which would be a lot more headache in general. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
04b2fa9f |
|
02-Feb-2015 |
Christoph Hellwig <hch@lst.de> |
fs: split generic and aio kiocb Most callers in the kernel want to perform synchronous file I/O, but still have to bloat the stack with a full struct kiocb. Split out the parts needed in filesystem code from those in the aio code, and only allocate those needed to pass down argument on the stack. The aio code embedds the generic iocb in the one it allocates and can easily get back to it by using container_of. Also add a ->ki_complete method to struct kiocb, this is used to call into the aio code and thus removes the dependency on aio for filesystems impementing asynchronous operations. It will also allow other callers to substitute their own completion callback. We also add a new ->ki_flags field to work around the nasty layering violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add eventfd notification about ffs events"). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
599bd19b |
|
11-Feb-2015 |
Christoph Hellwig <hch@lst.de> |
fs: don't allow to complete sync iocbs through aio_complete The AIO interface is fairly complex because it tries to allow filesystems to always work async and then wakeup a synchronous caller through aio_complete. It turns out that basically no one was doing this to avoid the complexity and context switches, and we've already fixed up the remaining users and can now get rid of this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
66ee59af |
|
11-Feb-2015 |
Christoph Hellwig <hch@lst.de> |
fs: remove ki_nbytes There is no need to pass the total request length in the kiocb, as we already get passed in through the iov_iter argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
acd88d4e |
|
04-Feb-2015 |
Kinglong Mee <kinglongmee@gmail.com> |
fs/aio.c: Remove duplicate function name in pr_debug messages Have defined pr_fmt as below in fs/aio.c, so remove duplicate function name in pr_debug message. #define pr_fmt(fmt) "%s: " fmt, __func__ Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9c9ce763 |
|
03-Feb-2015 |
Dave Chinner <dchinner@redhat.com> |
aio: annotate aio_read_event_ring for sleep patterns Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw warnings like the following due to being called from wait_event context: WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810d85a3>] prepare_to_wait_event+0x63/0x110 Modules linked in: CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8 ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8 ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300 Call Trace: [<ffffffff81daf2bd>] dump_stack+0x4c/0x65 [<ffffffff8109beda>] warn_slowpath_common+0x8a/0xc0 [<ffffffff8109bf56>] warn_slowpath_fmt+0x46/0x50 [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110 [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110 [<ffffffff810bdfcf>] __might_sleep+0x7f/0x90 [<ffffffff81db8344>] mutex_lock+0x24/0x45 [<ffffffff81216b7c>] aio_read_events+0x4c/0x290 [<ffffffff81216fac>] read_events+0x1ec/0x220 [<ffffffff810d8650>] ? prepare_to_wait_event+0x110/0x110 [<ffffffff810fdb10>] ? hrtimer_get_res+0x50/0x50 [<ffffffff8121899d>] SyS_io_getevents+0x4d/0xb0 [<ffffffff81dba5a9>] system_call_fastpath+0x12/0x17 ---[ end trace bde69eaf655a4fea ]--- There is not actually a bug here, so annotate the code to tell the debug logic that everything is just fine and not to fire a false positive. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
b83ae6d4 |
|
14-Jan-2015 |
Christoph Hellwig <hch@lst.de> |
fs: remove mapping->backing_dev_info Now that we never use the backing_dev_info pointer in struct address_space we can simply remove it and save 4 to 8 bytes in every inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
b4caecd4 |
|
14-Jan-2015 |
Christoph Hellwig <hch@lst.de> |
fs: introduce f_op->mmap_capabilities for nommu mmap support Since "BDI: Provide backing device capability information [try #3]" the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
5f785de5 |
|
06-Nov-2014 |
Fam Zheng <famz@redhat.com> |
aio: Skip timer for io_getevents if timeout=0 In this case, it is basically a polling. Let's not involve timer at all because that would hurt performance for application event loops. In an arbitrary test I've done, io_getevents syscall elapsed time reduces from 50000+ nanoseconds to a few hundereds. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
e4a0d3e7 |
|
18-Sep-2014 |
Pavel Emelyanov <xemul@parallels.com> |
aio: Make it possible to remap aio ring There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
835f252c |
|
06-Nov-2014 |
Gu Zheng <guz.fnst@cn.fujitsu.com> |
aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer https://bugzilla.kernel.org/show_bug.cgi?id=86831 Markus reported that when shutting down mysqld (with AIO support, on a ext3 formatted Harddrive) leads to a negative number of dirty pages (underrun to the counter). The negative number results in a drastic reduction of the write performance because the page cache is not used, because the kernel thinks it is still 2 ^ 32 dirty pages open. Add a warn trace in __dec_zone_state will catch this easily: static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item) { atomic_long_dec(&zone->vm_stat[item]); + WARN_ON_ONCE(item == NR_FILE_DIRTY && atomic_long_read(&zone->vm_stat[item]) < 0); atomic_long_dec(&vm_stat[item]); } [ 21.341632] ------------[ cut here ]------------ [ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242 cancel_dirty_page+0x164/0x224() [ 21.355296] Modules linked in: wutbox_cp sata_mv [ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80 [ 21.366793] Workqueue: events free_ioctx [ 21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>] (show_stack+0x20/0x24) [ 21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>] (dump_stack+0x24/0x28) [ 21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>] (warn_slowpath_common+0x84/0x9c) [ 21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>] (warn_slowpath_null+0x2c/0x34) [ 21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>] (cancel_dirty_page+0x164/0x224) [ 21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>] (truncate_inode_page+0x8c/0x158) [ 21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>] (truncate_inode_pages_range+0x11c/0x53c) [ 21.429890] [<c00c0a94>] (truncate_inode_pages_range) from [<c00c0f6c>] (truncate_pagecache+0x88/0xac) [ 21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>] (truncate_setsize+0x5c/0x74) [ 21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>] (put_aio_ring_file.isra.14+0x34/0x90) [ 21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from [<c013b424>] (aio_free_ring+0x20/0xcc) [ 21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>] (free_ioctx+0x24/0x44) [ 21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>] (process_one_work+0x134/0x47c) [ 21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>] (worker_thread+0x130/0x414) [ 21.489350] [<c003e988>] (worker_thread) from [<c00448ac>] (kthread+0xd4/0xec) [ 21.496621] [<c00448ac>] (kthread) from [<c000ec18>] (ret_from_fork+0x14/0x20) [ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]--- The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty (bypasses the VFS dirty pages increment) when init, and aio fs uses *default_backing_dev_info* as the backing dev, which does not disable the dirty pages accounting capability. So truncating aio ring file will contribute to accounting dirty pages (VFS dirty pages decrement), then error occurs. The original goal is keeping these pages in memory (can not be reclaimed or swapped) in life-time via marking it dirty. But thinking more, we have already pinned pages via elevating the page's refcount, which can already achieve the goal, so the SetPageDirty seems unnecessary. In order to fix the issue, using the __set_page_dirty_no_writeback instead of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually set the dirty flags, don't disable set_page_dirty(), rely on default behaviour). With the above change, the dirty pages accounting can work well. But as we known, aio fs is an anonymous one, which should never cause any real write-back, we can ignore the dirty pages (write back) accounting by disabling the dirty pages (write back) accounting capability. So we introduce an aio private backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to replace the default one. Reported-by: Markus Königshaus <m.koenigshaus@wut.de> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: stable <stable@vger.kernel.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
2aad2a86 |
|
24-Sep-2014 |
Tejun Heo <tj@kernel.org> |
percpu_ref: add PERCPU_REF_INIT_* flags With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
|
#
a34375ef |
|
07-Sep-2014 |
Tejun Heo <tj@kernel.org> |
percpu-refcount: add @gfp to percpu_ref_init() Percpu allocator now supports allocation mask. Add @gfp to percpu_ref_init() so that !GFP_KERNEL allocation masks can be used with percpu_refs too. This patch doesn't make any functional difference. v2: blk-mq conversion was missing. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Jens Axboe <axboe@kernel.dk>
|
#
6098b45b |
|
03-Sep-2014 |
Gu Zheng <guz.fnst@cn.fujitsu.com> |
aio: block exit_aio() until all context requests are completed It seems that exit_aio() also needs to wait for all iocbs to complete (like io_destroy), but we missed the wait step in current implemention, so fix it in the same way as we did in io_destroy. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org
|
#
2ff396be |
|
02-Sep-2014 |
Jeff Moyer <jmoyer@redhat.com> |
aio: add missing smp_rmb() in read_events_ring We ran into a case on ppc64 running mariadb where io_getevents would return zeroed out I/O events. After adding instrumentation, it became clear that there was some missing synchronization between reading the tail pointer and the events themselves. This small patch fixes the problem in testing. Thanks to Zach for helping to look into this, and suggesting the fix. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org
|
#
d856f32a |
|
24-Aug-2014 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix reqs_available handling As reported by Dan Aloni, commit f8567a3845ac ("aio: fix aio request leak when events are reaped by userspace") introduces a regression when user code attempts to perform io_submit() with more events than are available in the ring buffer. Reverting that commit would reintroduce a regression when user space event reaping is used. Fixing this bug is a bit more involved than the previous attempts to fix this regression. Since we do not have a single point at which we can count events as being reaped by user space and io_getevents(), we have to track event completion by looking at the number of events left in the event ring. So long as there are as many events in the ring buffer as there have been completion events generate, we cannot call put_reqs_available(). The code to check for this is now placed in refill_reqs_available(). A test program from Dan and modified by me for verifying this bug is available at http://www.kvack.org/~bcrl/20140824-aio_bug.c . Reported-by: Dan Aloni <dan@kernelim.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Acked-by: Dan Aloni <dan@kernelim.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: stable@vger.kernel.org # v3.16 and anything that f8567a3845ac was backported to Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
00fefb9c |
|
23-Jul-2014 |
Gu Zheng <guz.fnst@cn.fujitsu.com> |
aio: use iovec array rather than the single one Previously, we only offer a single iovec to handle all the read/write cases, so the PREADV/PWRITEV request always need to alloc more iovec buffer when copying user vectors. If we use a tmp iovec array rather than the single one, some small PREADV/PWRITEV workloads(vector size small than the tmp buffer) will not need to alloc more iovec buffer when copying user vectors. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
2be4e7de |
|
23-Jul-2014 |
Gu Zheng <guz.fnst@cn.fujitsu.com> |
aio: fix some comments The function comments of aio_run_iocb and aio_read_events are out of date, so fix them here. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
8dc4379e |
|
23-Jul-2014 |
Gu Zheng <guz.fnst@cn.fujitsu.com> |
aio: use the macro rather than the inline magic number Replace the inline magic number with the ready-made macro(AIO_RING_MAGIC), just clean up. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
b53f1f82 |
|
23-Jul-2014 |
Gu Zheng <guz.fnst@cn.fujitsu.com> |
aio: remove the needless registration of ring file's private_data Remove the registration of ring file's private_data, we do not use it. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
be6fb451 |
|
22-Jul-2014 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: remove no longer needed preempt_disable() Based on feedback from Jens Axboe on 263782c1c95bbddbb022dc092fd89a36bb8d5577, clean up get/put_reqs_available() to remove the no longer needed preempt_disable() and preempt_enable() pair. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Jens Axboe <axboe@kernel.dk>
|
#
263782c1 |
|
13-Jul-2014 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: protect reqs_available updates from changes in interrupt handlers As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to have put_reqs_available() called from irq context. While put_reqs_available() is per cpu, it did not protect itself from interrupts on the same CPU. This lead to aio_complete() corrupting the available io requests count when run under a heavy O_DIRECT workloads as reported by Robert Elliott. Fix this by disabling irq updates around the per cpu batch updates of reqs_available. Many thanks to Robert and folks for testing and tracking this down. Reported-by: Robert Elliot <Elliott@hp.com> Tested-by: Robert Elliot <Elliott@hp.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org> Cc: stable@vger.kenel.org
|
#
9a1049da |
|
28-Jun-2014 |
Tejun Heo <tj@kernel.org> |
percpu-refcount: require percpu_ref to be exited explicitly Currently, a percpu_ref undoes percpu_ref_init() automatically by freeing the allocated percpu area when the percpu_ref is killed. While seemingly convenient, this has the following niggles. * It's impossible to re-init a released reference counter without going through re-allocation. * In the similar vein, it's impossible to initialize a percpu_ref count with static percpu variables. * We need and have an explicit destructor anyway for failure paths - percpu_ref_cancel_init(). This patch removes the automatic percpu counter freeing in percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a generic destructor now named percpu_ref_exit(). percpu_ref_destroy() is considered but it gets confusing with percpu_ref_kill() while "exit" clearly indicates that it's the counterpart of percpu_ref_init(). All percpu_ref_cancel_init() users are updated to invoke percpu_ref_exit() instead and explicit percpu_ref_exit() calls are added to the destruction path of all percpu_ref users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Li Zefan <lizefan@huawei.com>
|
#
55c6c814 |
|
28-Jun-2014 |
Tejun Heo <tj@kernel.org> |
percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() ioctx_alloc() reaches inside percpu_ref and directly frees ->pcpu_count in its failure path, which is quite gross. percpu_ref has been providing a proper interface to do this, percpu_ref_cancel_init(), for quite some time now. Let's use that instead. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Kent Overstreet <kmo@daterainc.com>
|
#
855ef0de |
|
30-Apr-2014 |
Oleg Nesterov <oleg@redhat.com> |
aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx() ioctx_add_table() is the writer, it does not need rcu_read_lock() to protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just add the confusion. And it doesn't need rcu_dereference() by the same reason, it must see any updates previously done under the same ->ioctx_lock. We could use rcu_dereference_protected() but the patch uses rcu_dereference_raw(), the function is simple enough. The same for kill_ioctx(), although it does not update the pointer. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
4b70ac5f |
|
30-Apr-2014 |
Oleg Nesterov <oleg@redhat.com> |
aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() On 04/30, Benjamin LaHaise wrote: > > > - ctx->mmap_size = 0; > > - > > - kill_ioctx(mm, ctx, NULL); > > + if (ctx) { > > + ctx->mmap_size = 0; > > + kill_ioctx(mm, ctx, NULL); > > + } > > Rather than indenting and moving the two lines changing mmap_size and the > kill_ioctx() call, why not just do "if (!ctx) ... continue;"? That reduces > the number of lines changed and avoid excessive indentation. OK. To me the code looks better/simpler with "if (ctx)", but this is subjective of course, I won't argue. The patch still removes the empty line between mmap_size = 0 and kill_ioctx(), we reset mmap_size only for kill_ioctx(). But feel free to remove this change. ------------------------------------------------------------------------------- Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() 1. We can read ->ioctx_table only once and we do not read rcu_read_lock() or even rcu_dereference(). This mm has no users, nobody else can play with ->ioctx_table. Otherwise the code is buggy anyway, if we need rcu_read_lock() in a loop because ->ioctx_table can be updated then kfree(table) is obviously wrong. 2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid munmap(), but another reason is that we simply can't do vm_munmap() unless current->mm == mm and this is not true in general, the caller is mmput(). 3. We do not really need to nullify mm->ioctx_table before return, probably the current code does this to catch the potential problems. But in this case RCU_INIT_POINTER(NULL) looks better. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
edfbbf38 |
|
24-Jun-2014 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10 by commit a31ad380bed817aa25f8830ad23e1a0480fef797. The changes made to aio_read_events_ring() failed to correctly limit the index into ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of an arbitrary page with a copy_to_user() to copy the contents into userspace. This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and Petr for disclosing this issue. This patch applies to v3.12+. A separate backport is needed for 3.10/3.11. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: stable@vger.kernel.org
|
#
f8567a38 |
|
24-Jun-2014 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix aio request leak when events are reaped by userspace The aio cleanups and optimizations by kmo that were merged into the 3.10 tree added a regression for userspace event reaping. Specifically, the reference counts are not decremented if the event is reaped in userspace, leading to the application being unable to submit further aio requests. This patch applies to 3.12+. A separate backport is required for 3.10/3.11. This issue was uncovered as part of CVE-2014-0206. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org Cc: Kent Overstreet <kmo@daterainc.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com>
|
#
293bc982 |
|
11-Feb-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
new methods: ->read_iter() and ->write_iter() Beginning to introduce those. Just the callers for now, and it's clumsier than it'll eventually become; once we finish converting aio_read and aio_write instances, the things will get nicer. For now, these guys are in parallel to ->aio_read() and ->aio_write(); they take iocb and iov_iter, with everything in iov_iter already validated. File offset is passed in iocb->ki_pos, iov/nr_segs - in iov_iter. Main concerns in that series are stack footprint and ability to split the damn thing cleanly. [fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
754320d6 |
|
30-Apr-2014 |
Leon Yu <chianglungyu@gmail.com> |
aio: fix potential leak in aio_run_iocb(). iovec should be reclaimed whenever caller of rw_copy_check_uvector() returns, but it doesn't hold when failure happens right after aio_setup_vectored_rw(). Fix that in a such way to avoid hairy goto. Signed-off-by: Leon Yu <chianglungyu@gmail.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org
|
#
fa88b6f8 |
|
28-Apr-2014 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: cleanup: flatten kill_ioctx() There is no need to have most of the code in kill_ioctx() indented. Flatten it. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
fb2d4483 |
|
28-Apr-2014 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: report error from io_destroy() when threads race in io_destroy() As reported by Anatol Pomozov, io_destroy() fails to report an error when it loses the race to destroy a given ioctx. Since there is a difference in behaviour between the thread that wins the race (which blocks on outstanding io requests) versus lthe thread that loses (which returns immediately), wire up a return code from kill_ioctx() to the io_destroy() syscall. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Anatol Pomozov <anatol.pomozov@gmail.com>
|
#
d52a8f9e |
|
21-Apr-2014 |
Fabian Frederick <fabf@skynet.be> |
fs/aio.c: Remove ctx parameter in kiocb_cancel ctx is no longer used in kiocb_cancel since 57282d8fd74407 ("aio: Kill ki_users") Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
e02ba72a |
|
15-Apr-2014 |
Anatol Pomozov <anatol.pomozov@gmail.com> |
aio: block io_destroy() until all context requests are completed deletes aio context and all resources related to. It makes sense that no IO operations connected to the context should be running after the context is destroyed. As we removed io_context we have no chance to get requests status or call io_getevents(). man page for io_destroy says that this function may block until all context's requests are completed. Before kernel 3.11 io_destroy() blocked indeed, but since aio refactoring in 3.11 it is not true anymore. Here is a pseudo-code that shows a testcase for a race condition discovered in 3.11: initialize io_context io_submit(read to buffer) io_destroy() // context is destroyed so we can free the resources free(buffers); // if the buffer is allocated by some other user he'll be surprised // to learn that the buffer still filled by an outstanding operation // from the destroyed io_context The fix is straight-forward - add a completion struct and wait on it in io_destroy, complete() should be called when number of in-fligh requests reaches zero. If two or more io_destroy() called for the same context simultaneously then only the first one waits for IO completion, other calls behaviour is undefined. Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and do not see the race condition anymore. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
fa8a53c3 |
|
28-Mar-2014 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues exist in the aio ring page migration support. As a result, for example, we have the following problem: thread 1 | thread 2 | aio_migratepage() | |-> take ctx->completion_lock | |-> migrate_page_copy(new, old) | | *NOW*, ctx->ring_pages[idx] == old | | | *NOW*, ctx->ring_pages[idx] == old | aio_read_events_ring() | |-> ring = kmap_atomic(ctx->ring_pages[0]) | |-> ring->head = head; *HERE, write to the old ring page* | |-> kunmap_atomic(ring); | |-> ctx->ring_pages[idx] = new | | *BUT NOW*, the content of | | ring_pages[idx] is old. | |-> release ctx->completion_lock | As above, the new ring page will not be updated. Fix this issue, as well as prevent races in aio_ring_setup() by holding the ring_lock mutex during kioctx setup and page migration. This avoids the overhead of taking another spinlock in aio_read_events_ring() as Tang's and Gu's original fix did, pushing the overhead into the migration code. Note that to handle the nesting of ring_lock inside of mmap_sem, the migratepage operation uses mutex_trylock(). Page migration is not a 100% critical operation in this case, so the ocassional failure can be tolerated. This issue was reported by Sasha Levin. Based on feedback from Linus, avoid the extra taking of ctx->completion_lock. Instead, make page migration fully serialised by mapping->private_lock, and have aio_free_ring() simply disconnect the kioctx from the mapping by calling put_aio_ring_file() before touching ctx->ring_pages[]. This simplifies the error handling logic in aio_migratepage(), and should improve robustness. v4: always do mutex_unlock() in cases when kioctx setup fails. Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: stable@vger.kernel.org
|
#
3dc9acb6 |
|
19-Dec-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
aio: clean up and fix aio_setup_ring page mapping Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages migration") the aio ring setup code has used a special per-ring backing inode for the page allocations, rather than just using random anonymous pages. However, rather than remembering the pages as it allocated them, it would allocate the pages, insert them into the file mapping (dirty, so that they couldn't be free'd), and then forget about them. And then to look them up again, it would mmap the mapping, and then use "get_user_pages()" to get back an array of the pages we just created. Now, not only is that incredibly inefficient, it also leaked all the pages if the mmap failed (which could happen due to excessive number of mappings, for example). So clean it all up, making it much more straightforward. Also remove some left-overs of the previous (broken) mm_populate() usage that was removed in commit d6c355c7dabc ("aio: fix race in ring buffer page lookup introduced by page migration support") but left the pointless and now misleading MAP_POPULATE flag around. Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8e321fef |
|
21-Dec-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio/migratepages: make aio migrate pages sane The arbitrary restriction on page counts offered by the core migrate_page_move_mapping() code results in rather suspicious looking fiddling with page reference counts in the aio_migratepage() operation. To fix this, make migrate_page_move_mapping() take an extra_count parameter that allows aio to tell the code about its own reference count on the page being migrated. While cleaning up aio_migratepage(), make it validate that the old page being passed in is actually what aio_migratepage() expects to prevent misbehaviour in the case of races. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
1881686f |
|
21-Dec-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix kioctx leak introduced by "aio: Fix a trinity splat" e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference counting to correct a bug trinity found. Unfortunately, the change lead to kioctxes being leaked because there was no final reference count to put. Add that reference count back in to fix things. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org
|
#
d1b94327 |
|
04-Dec-2013 |
Gu Zheng <guz.fnst@cn.fujitsu.com> |
aio: clean up aio ring in the fail path Clean up the aio ring file in the fail path of aio_setup_ring and ioctx_alloc. And maybe it can fix the GPF issue reported by Dave Jones: https://lkml.org/lkml/2013/11/25/898 Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
ddb8c45b |
|
19-Nov-2013 |
Sasha Levin <sasha.levin@oracle.com> |
aio: nullify aio->ring_pages after freeing it After freeing ring_pages we leave it as is causing a dangling pointer. This has already caused an issue so to help catching any issues in the future NULL it out. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
d5580232 |
|
19-Nov-2013 |
Sasha Levin <sasha.levin@oracle.com> |
aio: prevent double free in ioctx_alloc ioctx_alloc() calls aio_setup_ring() to allocate a ring. If aio_setup_ring() fails to do so it would call aio_free_ring() before returning, but ioctx_alloc() would call aio_free_ring() again causing a double free of the ring. This is easily reproducible from userspace. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
7f62656b |
|
13-Nov-2013 |
Dan Carpenter <dan.carpenter@oracle.com> |
aio: checking for NULL instead of IS_ERR alloc_anon_inode() returns an ERR_PTR(), it doesn't return NULL. Fixes: 71ad7490c1f3 ('rework aio migrate pages to use aio fs') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
71ad7490 |
|
17-Sep-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
rework aio migrate pages to use aio fs Don't abuse anon_inodes.c to host private files needed by aio; we can bloody well declare a mini-fs of our own instead of patching up what anon_inodes can create for us. Tested-by: Benjamin LaHaise <bcrl@kvack.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e34ecee2 |
|
10-Oct-2013 |
Kent Overstreet <kmo@daterainc.com> |
aio: Fix a trinity splat aio kiocb refcounting was broken - it was relying on keeping track of the number of available ring buffer entries, which it needs to do anyways; then at shutdown time it'd wait for completions to be delivered until the # of available ring buffer entries equalled what it was initialized to. Problem with that is that the ring buffer is mapped writable into userspace, so userspace could futz with the head and tail pointers to cause the kernel to see extra completions, and cause free_ioctx() to return while there were still outstanding kiocbs. Which would be bad. Fix is just to directly refcount the kiocbs - which is more straightforward, and with the new percpu refcounting code doesn't cost us any cacheline bouncing which was the whole point of the original scheme. Also clean up ioctx_alloc()'s error path and fix a bug where it wasn't subtracting from aio_nr if ioctx_add_table() failed. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
#
5e9ae2e5 |
|
26-Sep-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix use-after-free in aio_migratepage Dmitry Vyukov managed to trigger a case where aio_migratepage can cause a use-after-free during teardown of the aio ring buffer's mapping. This turns out to be caused by access to the ioctx's ring_pages via the migratepage operation which was not being protected by any locks during ioctx freeing. Use the address_space's private_lock to protect use and updates of the mapping's private_data, and make ioctx teardown unlink the ioctx from the address space. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
d9b2c871 |
|
08-Sep-2013 |
Artem Savkov <artem.savkov@gmail.com> |
aio: rcu_read_lock protection for new rcu_dereference calls Patch "aio: fix rcu sparse warnings introduced by ioctx table lookup patch" (77d30b14d24e557f89c41980011d72428514d729 in linux-next.git) introduced a couple of new rcu_dereference calls which are not protected by rcu_read_lock and result in following warnings during syscall fuzzing(trinity): [ 471.646379] =============================== [ 471.649727] [ INFO: suspicious RCU usage. ] [ 471.653919] 3.11.0-next-20130906+ #496 Not tainted [ 471.657792] ------------------------------- [ 471.661235] fs/aio.c:503 suspicious rcu_dereference_check() usage! [ 471.665968] [ 471.665968] other info that might help us debug this: [ 471.665968] [ 471.672141] [ 471.672141] rcu_scheduler_active = 1, debug_locks = 1 [ 471.677549] 1 lock held by trinity-child0/3774: [ 471.681675] #0: (&(&mm->ioctx_lock)->rlock){+.+...}, at: [<c119ba1a>] SyS_io_setup+0x63a/0xc70 [ 471.688721] [ 471.688721] stack backtrace: [ 471.692488] CPU: 1 PID: 3774 Comm: trinity-child0 Not tainted 3.11.0-next-20130906+ #496 [ 471.698437] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 471.703151] 00000000 00000000 c58bbf30 c18a814b de2234c0 c58bbf58 c10a4ec6 c1b0d824 [ 471.709544] c1b0f60e 00000001 00000001 c1af61b0 00000000 cb670ac0 c3aca000 c58bbfac [ 471.716251] c119bc7c 00000002 00000001 00000000 c119b8dd 00000000 c10cf684 c58bbfb4 [ 471.722902] Call Trace: [ 471.724859] [<c18a814b>] dump_stack+0x4b/0x66 [ 471.728772] [<c10a4ec6>] lockdep_rcu_suspicious+0xc6/0x100 [ 471.733716] [<c119bc7c>] SyS_io_setup+0x89c/0xc70 [ 471.737806] [<c119b8dd>] ? SyS_io_setup+0x4fd/0xc70 [ 471.741689] [<c10cf684>] ? __audit_syscall_entry+0x94/0xe0 [ 471.746080] [<c18b1fcc>] syscall_call+0x7/0xb [ 471.749723] [<c1080000>] ? task_fork_fair+0x240/0x260 Signed-off-by: Artem Savkov <artem.savkov@gmail.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
d6c355c7 |
|
09-Sep-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix race in ring buffer page lookup introduced by page migration support Prior to the introduction of page migration support in "fs/aio: Add support to aio ring pages migration" / 36bc08cc01709b4a9bb563b35aa530241ddc63e3, mapping of the ring buffer pages was done via get_user_pages() while retaining mmap_sem held for write. This avoided possible races with userland racing an munmap() or mremap(). The page migration patch, however, switched to using mm_populate() to prime the page mapping. mm_populate() cannot be called with mmap_sem held. Instead of dropping the mmap_sem, revert to the old behaviour and simply drop the use of mm_populate() since get_user_pages() will cause the pages to get mapped anyways. Thanks to Al Viro for spotting this issue. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
77d30b14 |
|
30-Aug-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix rcu sparse warnings introduced by ioctx table lookup patch Sseveral sparse warnings were caused by missing rcu_dereference() annotations for dereferencing mm->ioctx_table. Thankfully, none of those were actual bugs as the deref was protected by a spin lock in all instances. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
|
#
79bd1bcf |
|
30-Aug-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: remove unnecessary debugging from aio_free_ring() The commit 36bc08cc0170 ("fs/aio: Add support to aio ring pages migration") added some debugging code that is not required and resulted in a build error when 98474236f72e ("vfs: make the dentry cache use the lockref infrastructure") was added to the tree. The code is not required, so just delete it. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
f30d704f |
|
07-Aug-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: table lookup: verify ctx pointer Another shortcoming of the table lookup patch was revealed where the pointer was not being tested before being dereferenced. Verify this to avoid the NULL pointer dereference. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
da90382c |
|
05-Aug-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" In the patch "aio: convert the ioctx list to table lookup v3", incorrect handling in the ioctx_alloc() error path was introduced that lead to an ioctx being added via ioctx_add_table() while freed when the ioctx_alloc() call returned -EAGAIN due to hitting the aio_max_nr limit. Fix this by only calling ioctx_add_table() as the last step in ioctx_alloc(). Also, several unnecessary rcu_dereference() calls were added that lead to RCU warnings where the system was already protected by a spin lock for accessing mm->ioctx_table. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
6878ea72 |
|
31-Jul-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: be defensive to ensure request batching is non-zero instead of BUG_ON() In the event that an overflow/underflow occurs while calculating req_batch, clamp the minimum at 1 request instead of doing a BUG_ON(). Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
db446a08 |
|
29-Jul-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: convert the ioctx list to table lookup v3 On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote: > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote: > > When using a large number of threads performing AIO operations the > > IOCTX list may get a significant number of entries which will cause > > significant overhead. For example, when running this fio script: > > > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1 > > blocksize=1024; numjobs=512; thread; loops=100 > > > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to > > 30% CPU time spent by lookup_ioctx: > > > > 32.51% [guest.kernel] [g] lookup_ioctx > > 9.19% [guest.kernel] [g] __lock_acquire.isra.28 > > 4.40% [guest.kernel] [g] lock_release > > 4.19% [guest.kernel] [g] sched_clock_local > > 3.86% [guest.kernel] [g] local_clock > > 3.68% [guest.kernel] [g] native_sched_clock > > 3.08% [guest.kernel] [g] sched_clock_cpu > > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11 > > 2.60% [guest.kernel] [g] memcpy > > 2.33% [guest.kernel] [g] lock_acquired > > 2.25% [guest.kernel] [g] lock_acquire > > 1.84% [guest.kernel] [g] do_io_submit > > > > This patchs converts the ioctx list to a radix tree. For a performance > > comparison the above FIO script was run on a 2 sockets 8 core > > machine. This are the results (average and %rsd of 10 runs) for the > > original list based implementation and for the radix tree based > > implementation: > > > > cores 1 2 4 8 16 32 > > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms > > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43% > > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms > > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75% > > % of radix > > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66% > > to list > > > > To consider the impact of the patch on the typical case of having > > only one ctx per process the following FIO script was run: > > > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1 > > blocksize=1024; numjobs=1; thread; loops=100 > > > > on the same system and the results are the following: > > > > list 58892 ms > > %rsd 0.91% > > radix 59404 ms > > %rsd 0.81% > > % of radix > > relative 100.87% > > to list > > So, I was just doing some benchmarking/profiling to get ready to send > out the aio patches I've got for 3.11 - and it looks like your patch is > causing a ~1.5% throughput regression in my testing :/ ... <snip> I've got an alternate approach for fixing this wart in lookup_ioctx()... Instead of using an rbtree, just use the reserved id in the ring buffer header to index an array pointing the ioctx. It's not finished yet, and it needs to be tidied up, but is most of the way there. -ben -- "Thought is the essence of where you are now." -- kmo> And, a rework of Ben's code, but this was entirely his idea kmo> -Kent bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually free memory. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
4cd81c3d |
|
29-Jul-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: double aio_max_nr in calculations With the changes to use percpu counters for aio event ring size calculation, existing increases to aio_max_nr are now insufficient to allow for the allocation of enough events. Double the value used for aio_max_nr to account for the doubling introduced by the percpu slack. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
d29c445b |
|
18-Mar-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: Kill ki_dtor sock_aio_dtor() is dead code - and stuff that does need to do cleanup can simply do it before calling aio_complete(). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
57282d8f |
|
13-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: Kill ki_users The kiocb refcount is only needed for cancellation - to ensure a kiocb isn't freed while a ki_cancel callback is running. But if we restrict ki_cancel callbacks to not block (which they currently don't), we can simply drop the refcount. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
8bc92afc |
|
25-Feb-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: Kill unneeded kiocb members The old aio retry infrastucture needed to save the various arguments to to aio operations. But with the retry infrastructure gone, we can trim struct kiocb quite a bit. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
73a7075e |
|
09-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: Kill aio_rw_vect_retry() This code doesn't serve any purpose anymore, since the aio retry infrastructure has been removed. This change should be safe because aio_read/write are also used for synchronous IO, and called from do_sync_read()/do_sync_write() - and there's no looping done in the sync case (the read and write syscalls). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
5ffac122 |
|
09-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: Don't use ctx->tail unnecessarily aio_complete() (arguably) needs to keep its own trusted copy of the tail pointer, but io_getevents() doesn't have to use it - it's already using the head pointer from the ring buffer. So convert it to use the tail from the ring buffer so it touches fewer cachelines and doesn't contend with the cacheline aio_complete() needs. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
bec68faa |
|
13-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: io_cancel() no longer returns the io_event Originally, io_event() was documented to return the io_event if cancellation succeeded - the io_event wouldn't be delivered via the ring buffer like it normally would. But this isn't what the implementation was actually doing; the only driver implementing cancellation, the usb gadget code, never returned an io_event in its cancel function. And aio_complete() was recently changed to no longer suppress event delivery if the kiocb had been cancelled. This gets rid of the unused io_event argument to kiocb_cancel() and kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if kiocb->ki_cancel() returned success. Also tweak the refcounting in kiocb_cancel() to make more sense. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
723be6e3 |
|
28-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: percpu ioctx refcount This just converts the ioctx refcount to the new generic dynamic percpu refcount code. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
e1bdd5f2 |
|
25-Apr-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: percpu reqs_available See the previous patch ("aio: reqs_active -> reqs_available") for why we want to do this - this basically implements a per cpu allocator for reqs_available that doesn't actually allocate anything. Note that we need to increase the size of the ringbuffer we allocate, since a single thread won't necessarily be able to use all the reqs_available slots - some (up to about half) might be on other per cpu lists, unavailable for the current thread. We size the ringbuffer based on the nr_events userspace passed to io_setup(), so this is a slight behaviour change - but nr_events wasn't being used as a hard limit before, it was being rounded up to the next page before so this doesn't change the actual semantics. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
34e83fc6 |
|
25-Apr-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: reqs_active -> reqs_available The number of outstanding kiocbs is one of the few shared things left that has to be touched for every kiocb - it'd be nice to make it percpu. We can make it per cpu by treating it like an allocation problem: we have a maximum number of kiocbs that can be outstanding (i.e. slots) - then we just allocate and free slots, and we know how to write per cpu allocators. So as prep work for that, we convert reqs_active to reqs_available. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
0c45355f |
|
17-Jul-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix build when migration is disabled When "fs/aio: Add support to aio ring pages migration" was applied, it broke the build when CONFIG_MIGRATION was disabled. Wrap the migration code with a test for CONFIG_MIGRATION to fix this and save a few bytes when migration is disabled. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
36bc08cc |
|
16-Jul-2013 |
Gu Zheng <guz.fnst@cn.fujitsu.com> |
fs/aio: Add support to aio ring pages migration As the aio job will pin the ring pages, that will lead to mem migrated failed. In order to fix this problem we use an anon inode to manage the aio ring pages, and setup the migratepage callback in the anon inode's address space, so that when mem migrating the aio ring pages will be moved to other mem node safely. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
|
#
4b30f07e |
|
03-Jul-2013 |
Tang Chen <tangchen@cn.fujitsu.com> |
aio: fix wrong comment in aio_complete() ctx->ctx_lock should be ctx->completion_lock. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
68d70d03 |
|
19-Jun-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
constify rw_verify_area() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4fcc712f |
|
12-Jun-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: fix io_destroy() regression by using call_rcu() There was a regression introduced by 36f5588905c1 ("aio: refcounting cleanup"), reported by Jens Axboe - the refcounting cleanup switched to using RCU in the shutdown path, but the synchronize_rcu() was done in the context of the io_destroy() syscall greatly increasing the time it could block. This patch switches it to call_rcu() and makes shutdown asynchronous (more asynchronous than it was originally; before the refcount changes io_destroy() would still wait on pending kiocbs). Note that there's a global quota on the max outstanding kiocbs, and that quota must be manipulated synchronously; otherwise io_setup() could return -EAGAIN when there isn't quota available, and userspace won't have any way of waiting until shutdown of the old kioctxs has finished (besides busy looping). So we release our quota before kioctx shutdown has finished, which should be fine since the quota never corresponded to anything real anyways. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Reported-by: Jens Axboe <axboe@kernel.dk> Tested-by: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Tested-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
03e04f04 |
|
24-May-2013 |
Benjamin LaHaise <bcrl@kvack.org> |
aio: fix kioctx not being freed after cancellation at exit time The recent changes overhauling fs/aio.c introduced a bug that results in the kioctx not being freed when outstanding kiocbs are cancelled at exit_aio() time. Specifically, a kiocb that is cancelled has its completion events discarded by batch_complete_aio(), which then fails to wake up the process stuck in free_ioctx(). Fix this by modifying the wait_event() condition in free_ioctx() appropriately. This patch was tested with the cancel operation in the thread based code posted yesterday. [akpm@linux-foundation.org: fix build] Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Josh Boyer <jwboyer@redhat.com> Cc: Zach Brown <zab@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6900807c |
|
24-May-2013 |
Jeff Moyer <jmoyer@redhat.com> |
aio: fix io_getevents documentation In reviewing man pages, I noticed that io_getevents is documented to update the timeout that gets passed into the library call. This doesn't happen in kernel space or in the library (even though it's documented to do so in both places). Unless there is objection, I'd like to fix the comments/docs to match the code (I will also update the man page upon consensus). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Acked-by: Cyril Hrubis <chrubis@suse.cz> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
41ef4eb8 |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: kill ki_retry Thanks to Zach Brown's work to rip out the retry infrastructure, we don't need this anymore - ki_retry was only called right after the kiocb was initialized. This also refactors and trims some duplicated code, as well as cleaning up the refcounting/error handling a bit. [akpm@linux-foundation.org: use fmode_t in aio_run_iocb()] [akpm@linux-foundation.org: fix file_start_write/file_end_write tests] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8a660890 |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: kill ki_key ki_key wasn't actually used for anything previously - it was always 0. Drop it to trim struct kiocb a bit. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4e23bcae |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: give shared kioctx fields their own cachelines [akpm@linux-foundation.org: make reqs_active __cacheline_aligned_in_smp] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
58c85dc2 |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: kill struct aio_ring_info struct aio_ring_info was kind of odd, the only place it's used is where it's embedded in struct kioctx - there's no real need for it. The next patch rearranges struct kioctx and puts various things on their own cachelines - getting rid of struct aio_ring_info now makes that reordering a bit clearer. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a1c8eae7 |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: kill batch allocation Previously, allocating a kiocb required touching quite a few global (well, per kioctx) cachelines... so batching up allocation to amortize those was worthwhile. But we've gotten rid of some of those, and in another couple of patches kiocb allocation won't require writing to any shared cachelines, so that means we can just rip this code out. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3e845ce0 |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: change reqs_active to include unreaped completions The aio code tries really hard to avoid having to deal with the completion ringbuffer overflowing. To do that, it has to keep track of the number of outstanding kiocbs, and the number of completions currently in the ringbuffer - and it's got to check that every time we allocate a kiocb. Ouch. But - we can improve this quite a bit if we just change reqs_active to mean "number of outstanding requests and unreaped completions" - that means kiocb allocation doesn't have to look at the ringbuffer, which is a fairly significant win. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0460fef2 |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: use cancellation list lazily Cancelling kiocbs requires adding them to a per kioctx linked list, which is one of the few things we need to take the kioctx lock for in the fast path. But most kiocbs can't be cancelled - so if we just do this lazily, we can avoid quite a bit of locking overhead. While we're at it, instead of using a flag bit switch to using ki_cancel itself to indicate that a kiocb has been cancelled/completed. This lets us get rid of ki_flags entirely. [akpm@linux-foundation.org: remove buggy BUG()] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
21b40200 |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: use flush_dcache_page() This wasn't causing problems before because it's not needed on x86, but it is needed on other architectures. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a31ad380 |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: make aio_read_evt() more efficient, convert to hrtimers Previously, aio_read_event() pulled a single completion off the ringbuffer at a time, locking and unlocking each time. Change it to pull off as many events as it can at a time, and copy them directly to userspace. This also fixes a bug where if copying the event to userspace failed, we'd lose the event. Also convert it to wait_event_interruptible_hrtimeout(), which simplifies it quite a bit. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
36f55889 |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: refcounting cleanup The usage of ctx->dead was fubar - it makes no sense to explicitly check it all over the place, especially when we're already using RCU. Now, ctx->dead only indicates whether we've dropped the initial refcount. The new teardown sequence is: set ctx->dead hlist_del_rcu(); synchronize_rcu(); Now we know no system calls can take a new ref, and it's safe to drop the initial ref: put_ioctx(); We also need to ensure there are no more outstanding kiocbs. This was done incorrectly - it was being done in kill_ctx(), and before dropping the initial refcount. At this point, other syscalls may still be submitting kiocbs! Now, we cancel and wait for outstanding kiocbs in free_ioctx(), after kioctx->users has dropped to 0 and we know no more iocbs could be submitted. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
11599eba |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: make aio_put_req() lockless Freeing a kiocb needed to touch the kioctx for three things: * Pull it off the reqs_active list * Decrementing reqs_active * Issuing a wakeup, if the kioctx was in the process of being freed. This patch moves these to aio_complete(), for a couple reasons: * aio_complete() already has to issue the wakeup, so if we drop the kioctx refcount before aio_complete does its wakeup we don't have to do it twice. * aio_complete currently has to take the kioctx lock, so it makes sense for it to pull the kiocb off the reqs_active list too. * A later patch is going to change reqs_active to include unreaped completions - this will mean allocating a kiocb doesn't have to look at the ringbuffer. So taking the decrement of reqs_active out of kiocb_free() is useful prep work for that patch. This doesn't really affect cancellation, since existing (usb) code that implements a cancel function still calls aio_complete() - we just have to make sure that aio_complete does the necessary teardown for cancelled kiocbs. It does affect code paths where we free kiocbs that were never submitted; they need to decrement reqs_active and pull the kiocb off the reqs_active list. This occurs in two places: kiocb_batch_free(), which is going away in a later patch, and the error path in io_submit_one. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1d98ebfc |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: do fget() after aio_get_req() aio_get_req() will fail if we have the maximum number of requests outstanding, which depending on the application may not be uncommon. So avoid doing an unnecessary fget(). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
caf4167a |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: dprintk() -> pr_debug() Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4e179bca |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: move private stuff out of aio.h Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
906b973c |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: add kiocb_cancel() Minor refactoring, to get rid of some duplicated code [akpm@linux-foundation.org: fix warning] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2d68449e |
|
07-May-2013 |
Kent Overstreet <koverstreet@google.com> |
aio: kill return value of aio_complete() Nothing used the return value, and it probably wasn't possible to use it safely for the locked versions (aio_complete(), aio_put_req()). Just kill it. Signed-off-by: Kent Overstreet <koverstreet@google.com> Acked-by: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
41003a7b |
|
07-May-2013 |
Zach Brown <zab@redhat.com> |
aio: remove retry-based AIO This removes the retry-based AIO infrastructure now that nothing in tree is using it. We want to remove retry-based AIO because it is fundemantally unsafe. It retries IO submission from a kernel thread that has only assumed the mm of the submitting task. All other task_struct references in the IO submission path will see the kernel thread, not the submitting task. This design flaw means that nothing of any meaningful complexity can use retry-based AIO. This removes all the code and data associated with the retry machinery. The most significant benefit of this is the removal of the locking around the unused run list in the submission path. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Zach Brown <zab@redhat.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
91d80a84 |
|
25-Apr-2013 |
Zhao Hongjiang <zhaohongjiang@huawei.com> |
aio: fix possible invalid memory access when DEBUG is enabled dprintk() shouldn't access @ring after it's unmapped. Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8d71db4f |
|
19-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
lift sb_start_write/sb_end_write out of ->aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2cf09666 |
|
21-Jan-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect ... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded uses of asmlinkage_protect() in a bunch of syscalls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b67bfe0d |
|
27-Feb-2013 |
Sasha Levin <sasha.levin@oracle.com> |
hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
41badc15 |
|
22-Feb-2013 |
Michel Lespinasse <walken@google.com> |
mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE multiple, however there was no equivalent code in mm_populate(), which caused issues. This could be fixed by introduced the same rounding in mm_populate(), however I think it's preferable to make do_mmap_pgoff() return populate as a size rather than as a boolean, so we don't have to duplicate the size rounding logic in mm_populate(). Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bebeb3d6 |
|
22-Feb-2013 |
Michel Lespinasse <walken@google.com> |
mm: introduce mm_populate() for populating new vmas When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or with MCL_FUTURE in effect), we want to populate the pages within the newly created vmas. This may take a while as we may have to read pages from disk, so ideally we want to do this outside of the write-locked mmap_sem region. This change introduces mm_populate(), which is used to defer populating such mappings until after the mmap_sem write lock has been released. This is implemented as a generalization of the former do_mlock_pages(), which accomplished the same task but was using during mlock() / mlockall(). Signed-off-by: Michel Lespinasse <walken@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3ffa3c0e |
|
24-Jun-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: now fput() is OK from interrupt context; get rid of manual delayed __fput() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e3fc629d |
|
30-May-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
switch aio and shm to do_mmap_pgoff(), make do_mmap() static after all, 0 bytes and 0 pages is the same thing... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ac34ebb3 |
|
31-May-2012 |
Christopher Yeoh <cyeoh@au1.ibm.com> |
aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after changes made to support CMA in an earlier patch. Rather than having an additional check_access parameter to these functions, the first paramater type is overloaded to allow the caller to specify CHECK_IOVEC_ONLY which means check that the contents of the iovec are valid, but do not check the memory that they point to. This is used by process_vm_readv/writev where we need to validate that a iovec passed to the syscall is valid but do not want to check the memory that it points to at this point because it refers to an address space in another process. Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a70b52ec |
|
21-May-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: make AIO use the proper rw_verify_area() area helpers We had for some reason overlooked the AIO interface, and it didn't use the proper rw_verify_area() helper function that checks (for example) mandatory locking on the file, and that the size of the access doesn't cause us to overflow the provided offset limits etc. Instead, AIO did just the security_file_permission() thing (that rw_verify_area() also does) directly. This fixes it to do all the proper helper functions, which not only means that now mandatory file locking works with AIO too, we can actually remove lines of code. Reported-by: Manish Honap <manish_honap_vit@yahoo.co.in> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bfce281c |
|
20-Apr-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
kill mm argument of vm_munmap() it's always current->mm Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
936af157 |
|
20-Apr-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: don't bother with unmapping when aio_free_ring() is coming from exit_aio() ... since exit_mmap() is coming and it will munmap() everything anyway. In all other cases aio_free_ring() has ctx->mm == current->mm; moreover, all other callers of vm_munmap() have mm == current->mm, so this will allow us to get rid of mm argument of vm_munmap(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a46ef99d |
|
20-Apr-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
VM: add "vm_munmap()" helper function Like the vm_brk() function, this is the same as "do_munmap()", except it does the VM locking for the caller. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a2e1859a |
|
20-Mar-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: take final put_ioctx() into callers of io_destroy() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
06af121e |
|
20-Mar-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: merge aio_cancel_all() with wait_for_all_aios() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9fcf03d0 |
|
13-Mar-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: fix the comment in aio_kick_handler() It should've been changed when queue_work() became queue_delayed_work(..., 0) in there. It's always had been about not needing a delay, not about not using specific function... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
cd1ea261 |
|
10-Mar-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: don't bother with cancel_delayed_work() in exit_aio() __put_ioctx() will cover it anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
bf50722a |
|
10-Mar-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: use cancel_delayed_work_sync() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9fa1cb39 |
|
10-Mar-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: aio_nr_lock is taken only synchronously now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2dd542b7 |
|
10-Mar-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: aio_nr decrements don't need to be delayed we can do that right in __put_ioctx(); as the result, the loop in ioctx_alloc() can be killed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e23754f8 |
|
06-Mar-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
aio: don't bother with async freeing on failure in ioctx_alloc() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e8e3c3d6 |
|
25-Nov-2011 |
Cong Wang <amwang@redhat.com> |
fs: remove the second argument of k[un]map_atomic() Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Cong Wang <amwang@redhat.com>
|
#
c7b28555 |
|
08-Mar-2012 |
Al Viro <viro@ZenIV.linux.org.uk> |
aio: fix the "too late munmap()" race Current code has put_ioctx() called asynchronously from aio_fput_routine(); that's done *after* we have killed the request that used to pin ioctx, so there's nothing to stop io_destroy() waiting in wait_for_all_aios() from progressing. As the result, we can end up with async call of put_ioctx() being the last one and possibly happening during exit_mmap() or elf_core_dump(), neither of which expects stray munmap() being done to them... We do need to prevent _freeing_ ioctx until aio_fput_routine() is done with that, but that's all we care about - neither io_destroy() nor exit_aio() will progress past wait_for_all_aios() until aio_fput_routine() does really_put_req(), so the ioctx teardown won't be done until then and we don't care about the contents of ioctx past that point. Since actual freeing of these suckers is RCU-delayed, we don't need to bump ioctx refcount when request goes into list for async removal. All we need is rcu_read_lock held just over the ->ctx_lock-protected area in aio_fput_routine(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
86b62a2c |
|
06-Mar-2012 |
Al Viro <viro@ZenIV.linux.org.uk> |
aio: fix io_setup/io_destroy race Have ioctx_alloc() return an extra reference, so that caller would drop it on success and not bother with re-grabbing it on failure exit. The current code is obviously broken - io_destroy() from another thread that managed to guess the address io_setup() would've returned would free ioctx right under us; gets especially interesting if aio_context_t * we pass to io_setup() points to PROT_READ mapping, so put_user() fails and we end up doing io_destroy() on kioctx another thread has just got freed... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
880641bb |
|
05-Mar-2012 |
Jeff Moyer <jmoyer@redhat.com> |
aio: wake up waiters when freeing unused kiocbs Bart Van Assche reported a hung fio process when either hot-removing storage or when interrupting the fio process itself. The (pruned) call trace for the latter looks like so: fio D 0000000000000001 0 6849 6848 0x00000004 ffff880092541b88 0000000000000046 ffff880000000000 ffff88012fa11dc0 ffff88012404be70 ffff880092541fd8 ffff880092541fd8 ffff880092541fd8 ffff880128b894d0 ffff88012404be70 ffff880092541b88 000000018106f24d Call Trace: schedule+0x3f/0x60 io_schedule+0x8f/0xd0 wait_for_all_aios+0xc0/0x100 exit_aio+0x55/0xc0 mmput+0x2d/0x110 exit_mm+0x10d/0x130 do_exit+0x671/0x860 do_group_exit+0x44/0xb0 get_signal_to_deliver+0x218/0x5a0 do_signal+0x65/0x700 do_notify_resume+0x65/0x80 int_signal+0x12/0x17 The problem lies with the allocation batching code. It will opportunistically allocate kiocbs, and then trim back the list of iocbs when there is not enough room in the completion ring to hold all of the events. In the case above, what happens is that the pruning back of events ends up freeing up the last active request and the context is marked as dead, so it is thus responsible for waking up waiters. Unfortunately, the code does not check for this condition, so we end up with a hung task. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org> Cc: <stable@kernel.org> [3.2.x only] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
630d9c47 |
|
16-Nov-2011 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
fs: reduce the use of module.h wherever possible For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
#
69e4747e |
|
08-Jan-2012 |
Gleb Natapov <gleb@redhat.com> |
Unused iocbs in a batch should not be accounted as active. Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are allocated in a batch during processing of first iocbs. All iocbs in a batch are automatically added to ctx->active_reqs list and accounted in ctx->reqs_active. If one (not the last one) of iocbs submitted by an user fails, further iocbs are not processed, but they are still present in ctx->active_reqs and accounted in ctx->reqs_active. This causes process to stuck in a D state in wait_for_all_aios() on exit since ctx->reqs_active will never go down to zero. Furthermore since kiocb_batch_free() frees iocb without removing it from active_reqs list the list become corrupted which may cause oops. Fix this by removing iocb from ctx->active_reqs and updating ctx->reqs_active in kiocb_batch_free(). Signed-off-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: stable@kernel.org # 3.2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
080d676d |
|
02-Nov-2011 |
Jeff Moyer <jmoyer@redhat.com> |
aio: allocate kiocbs in batches In testing aio on a fast storage device, I found that the context lock takes up a fair amount of cpu time in the I/O submission path. The reason is that we take it for every I/O submitted (see __aio_get_req). Since we know how many I/Os are passed to io_submit, we can preallocate the kiocbs in batches, reducing the number of times we take and release the lock. In my testing, I was able to reduce the amount of time spent in _raw_spin_lock_irq by .56% (average of 3 runs). The command I used to test this was: aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev> I also tested the patch with various numbers of events passed to io_submit, and I ran the xfstests aio group of tests to ensure I didn't break anything. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Daniel Ehrenberg <dehrenberg@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fcf63409 |
|
31-Oct-2011 |
Christopher Yeoh <cyeoh@au1.ibm.com> |
Cross Memory Attach The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: http://marc.info/?l=linux-mm&m=130105930902915&w=2 There is a basic man page for the proposed interface here: http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e91f90bb |
|
22-Mar-2011 |
Roland Dreier <roland@purestorage.com> |
aio: wake all waiters when destroying ctx The test program below will hang because io_getevents() uses add_wait_queue_exclusive(), which means the wake_up() in io_destroy() only wakes up one of the threads. Fix this by using wake_up_all() in the aio code paths where we want to make sure no one gets stuck. // t.c -- compile with gcc -lpthread -laio t.c #include <libaio.h> #include <pthread.h> #include <stdio.h> #include <unistd.h> static const int nthr = 2; void *getev(void *ctx) { struct io_event ev; io_getevents(ctx, 1, 1, &ev, NULL); printf("io_getevents returned\n"); return NULL; } int main(int argc, char *argv[]) { io_context_t ctx = 0; pthread_t thread[nthr]; int i; io_setup(1024, &ctx); for (i = 0; i < nthr; ++i) pthread_create(&thread[i], NULL, getev, ctx); sleep(1); io_destroy(ctx); for (i = 0; i < nthr; ++i) pthread_join(thread[i], NULL); return 0; } Signed-off-by: Roland Dreier <roland@purestorage.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cf15900e |
|
02-Mar-2011 |
Jens Axboe <jaxboe@fusionio.com> |
aio: remove request submission batching This should be useless now that we have on-stack plugging. So lets just kill it. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
9f5b9425 |
|
30-Jun-2010 |
Shaohua Li <shaohua.li@intel.com> |
fs: make aio plug Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
7eaceacc |
|
10-Mar-2011 |
Jens Axboe <jaxboe@fusionio.com> |
block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
7137c6bd |
|
25-Feb-2011 |
Jan Kara <jack@suse.cz> |
aio: fix race between io_destroy() and io_submit() A race can occur when io_submit() races with io_destroy(): CPU1 CPU2 io_submit() do_io_submit() ... ctx = lookup_ioctx(ctx_id); io_destroy() Now do_io_submit() holds the last reference to ctx. ... queue new AIO put_ioctx(ctx) - frees ctx with active AIOs We solve this issue by checking whether ctx is being destroyed in AIO submission path after adding new AIO to ctx. Then we are guaranteed that either io_destroy() waits for new AIO or we see that ctx is being destroyed and bail out. Cc: Nick Piggin <npiggin@kernel.dk> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3bd9a5d7 |
|
25-Feb-2011 |
Nick Piggin <npiggin@gmail.com> |
aio: fix rcu ioctx lookup aio-dio-invalidate-failure GPFs in aio_put_req from io_submit. lookup_ioctx doesn't implement the rcu lookup pattern properly. rcu_read_lock does not prevent refcount going to zero, so we might take a refcount on a zero count ioctx. Fix the bug by atomically testing for zero refcount before incrementing. [jack@suse.cz: added comment into the code] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d37adaa1 |
|
26-Jan-2011 |
Tejun Heo <tj@kernel.org> |
fs/aio: aio_wq isn't used in memory reclaim path aio_wq isn't used during memory reclaim. Convert to alloc_workqueue() without WQ_MEM_RECLAIM. It's possible to use system_wq but given that the number of work items is determined from userland and the work item may block, enforcing strict concurrency limit would be a good idea. Also, move fput_work to system_wq so that aio_wq is used soley to throttle the max concurrency of aio work items and fput_work doesn't interact with other work items. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: linux-aio@kvack.org
|
#
27eaa1c9 |
|
13-Dec-2010 |
Namhyung Kim <namhyung@gmail.com> |
aio: check return value of create_workqueue() Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d3486f8b |
|
12-Jan-2011 |
Jeff Moyer <jmoyer@redhat.com> |
aio: remove unused aio_run_iocbs() aio_run_iocbs() is not used at all, so get rid of it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2e410255 |
|
12-Jan-2011 |
Namhyung Kim <namhyung@gmail.com> |
aio: remove unnecessary check 'nr >= min_nr >= 0' always satisfies 'nr >= 0' so the check is unnecesary. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7de9c6ee |
|
23-Oct-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
new helper: ihold() Clones an existing reference to inode; caller must already hold one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
306fb097 |
|
23-Aug-2010 |
Chris Mason <chris.mason@oracle.com> |
aio: bump i_count instead of using igrab The aio batching code is using igrab to get an extra reference on the inode so it can safely batch. igrab will go ahead and take the global inode spinlock, which can be a bottleneck on large machines doing lots of AIO. In this case, igrab isn't required because we already have a reference on the file handle. It is safe to just bump the i_count directly on the inode. Benchmarking shows this patch brings IOP/s on tons of flash up by about 2.5X. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a0c42bac |
|
22-Sep-2010 |
Jan Kara <jack@suse.cz> |
aio: do not return ERESTARTSYS as a result of AIO OCFS2 can return ERESTARTSYS from its write function when the process is signalled while waiting for a cluster lock (and the filesystem is mounted with intr mount option). Generally, it seems reasonable to allow filesystems to return this error code from its IO functions. As we must not leak ERESTARTSYS (and similar error codes) to userspace as a result of an AIO operation, we have to properly convert it to EINTR inside AIO code (restarting the syscall isn't really an option because other AIO could have been already submitted by the same io_submit syscall). Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
75e1c70f |
|
10-Sep-2010 |
Jeff Moyer <jmoyer@redhat.com> |
aio: check for multiplication overflow in do_io_submit Tavis Ormandy pointed out that do_io_submit does not do proper bounds checking on the passed-in iocb array: if (unlikely(nr < 0)) return -EINVAL; if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp))))) return -EFAULT; ^^^^^^^^^^^^^^^^^^ The attached patch checks for overflow, and if it is detected, the number of iocbs submitted is scaled down to a number that will fit in the long. This is an ok thing to do, as sys_io_submit is documented as returning the number of iocbs submitted, so callers should handle a return value of less than the 'nr' argument passed in. Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
642b5123 |
|
05-Aug-2010 |
Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> |
aio: fix wrong subsystem comments - sys_io_destroy(): acutually return -EINVAL if the context pointed to is invalidIndex: linux-2.6.33-rc4/fs/aio.c - sys_io_getevents(): An argument specifying timeout is not `when', but `timeout'. - sys_io_getevents(): Should describe what is returned if this syscall succeeds. Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d7065da0 |
|
26-May-2010 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of the magic around f_count in aio __aio_put_req() plays sick games with file refcount. What it wants is fput() from atomic context; it's almost always done with f_count > 1, so they only have to deal with delayed work in rare cases when their reference happens to be the last one. Current code decrements f_count and if it hasn't hit 0, everything is fine. Otherwise it keeps a pointer to struct file (with zero f_count!) around and has delayed work do __fput() on it. Better way to do it: use atomic_long_add_unless( , -1, 1) instead of !atomic_long_dec_and_test(). IOW, decrement it only if it's not the last reference, leave refcount alone if it was. And use normal fput() in delayed work. I've made that atomic_long_add_unless call a new helper - fput_atomic(). Drops a reference to file if it's safe to do in atomic (i.e. if that's not the last one), tells if it had been able to do that. aio.c converted to it, __fput() use is gone. req->ki_file *always* contributes to refcount now. And __fput() became static. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9d85cba7 |
|
26-May-2010 |
Jeff Moyer <jmoyer@redhat.com> |
aio: fix the compat vectored operations The aio compat code was not converting the struct iovecs from 32bit to 64bit pointers, causing either EINVAL to be returned from io_getevents, or EFAULT as the result of the I/O. This patch passes a compat flag to io_submit to signal that pointer conversion is necessary for a given iocb array. A variant of this was tested by Michael Tokarev. I have also updated the libaio test harness to exercise this code path with good success. Further, I grabbed a copy of ltp and ran the testcases/kernel/syscall/readv and writev tests there (compiled with -m32 on my 64bit system). All seems happy, but extra eyes on this would be welcome. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> [2.6.35.1] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fac046ad |
|
15-Dec-2009 |
Shaohua Li <shaohua.li@intel.com> |
aio: remove unused field Don't know the reason, but it appears ki_wait field of iocb never gets used. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b9d128f1 |
|
29-Oct-2009 |
Jens Axboe <jens.axboe@oracle.com> |
block: move bdi/address_space unplug functions to backing-dev.h There's nothing block related about them, the backing device is used by things like NFS etc as well. This gets rid of the need to protect such calls by CONFIG_BLOCK. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
cfb1e33e |
|
02-Oct-2009 |
Jeff Moyer <jmoyer@redhat.com> |
aio: implement request batching Hi, Some workloads issue batches of small I/O, and the performance is poor due to the call to blk_run_address_space for every single iocb. Nathan Roberts pointed this out, and suggested that by deferring this call until all I/Os in the iocb array are submitted to the block layer, we can realize some impressive performance gains (up to 30% for sequential 4k reads in batches of 16). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
385773e0 |
|
22-Sep-2009 |
H Hartley Sweeten <hartleys@visionengravers.com> |
aio.c: move EXPORT* macros to line after function As mentioned in Documentation/CodingStyle, move EXPORT* macro's to the line immediately after the closing function brace line. Also, move the __initcall() similarly. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
3d2d827f |
|
21-Sep-2009 |
Michael S. Tsirkin <mst@redhat.com> |
mm: move use_mm/unuse_mm from aio.c to mm/ Anyone who wants to do copy to/from user from a kernel thread, needs use_mm (like what fs/aio has). Move that into mm/, to make reusing and exporting easier down the line, and make aio use it. Next intended user, besides aio, will be vhost-net. Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
13389010 |
|
30-Jun-2009 |
Davide Libenzi <davidel@xmailserver.org> |
eventfd: revised interface and cleanups Change the eventfd interface to de-couple the eventfd memory context, from the file pointer instance. Without such change, there is no clean way to racely free handle the POLLHUP event sent when the last instance of the file* goes away. Also, now the internal eventfd APIs are using the eventfd context instead of the file*. This patch is required by KVM's IRQfd code, which is still under development. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Gregory Haskins <ghaskins@novell.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
65c24491 |
|
18-Mar-2009 |
Jeff Moyer <jmoyer@redhat.com> |
aio: lookup_ioctx can return the wrong value when looking up a bogus context The libaio test harness turned up a problem whereby lookup_ioctx on a bogus io context was returning the 1 valid io context from the list (harness/cases/3.p). Because of that, an extra put_iocontext was done, and when the process exited, it hit a BUG_ON in the put_iocontext macro called from exit_aio (since we expect a users count of 1 and instead get 0). The problem was introduced by "aio: make the lookup_ioctx() lockless" (commit abf137dd7712132ee56d5b3143c2ff61a72a5faa). Thanks to Zach for pointing out that hlist_for_each_entry_rcu will not return with a NULL tpos at the end of the loop, even if the entry was not found. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Zach Brown <zach.brown@oracle.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
87c3a86e |
|
18-Mar-2009 |
Davide Libenzi <davidel@xmailserver.org> |
eventfd: remove fput() call from possible IRQ context Remove a source of fput() call from inside IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call from IRQ context, but Jeff said he was able to, with the attached test program. Independently from this, the bug is conceptually there, so we might be better off fixing it. This patch adds an optimization similar to the one we already do on ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty in general, but the alternative here would be to add a brand new delayed fput() infrastructure, that I'm not sure is worth it. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
002c8976 |
|
14-Jan-2009 |
Heiko Carstens <hca@linux.ibm.com> |
[CVE-2009-0029] System call wrappers part 16 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
#
abf137dd |
|
09-Dec-2008 |
Jens Axboe <jens.axboe@oracle.com> |
aio: make the lookup_ioctx() lockless The mm->ioctx_list is currently protected by a reader-writer lock, so we always grab that lock on the read side for doing ioctx lookups. As the workload is extremely reader biased, turn this into an rcu hlist so we can make lookup_ioctx() lockless. Get rid of the rwlock and use a spinlock for providing update side exclusion. There's usually only 1 entry on this list, so it doesn't make sense to look into fancier data structures. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
516e0cc5 |
|
25-Jul-2008 |
Al Viro <viro@zeniv.linux.org.uk> |
[PATCH] f_count may wrap around make it atomic_long_t; while we are at it, get rid of useless checks in affs, hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
246bb0b1 |
|
25-Jul-2008 |
Oleg Nesterov <oleg@tv-sign.ru> |
kill PF_BORROWED_MM in favour of PF_KTHREAD Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users. No functional changes yet. But this allows us to do further fixes/cleanups. oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the kthreads, this is wrong because of use_mm(). The problem with PF_BORROWED_MM is that we need task_lock() to avoid races. With this patch we can check PF_KTHREAD directly, or use a simple lockless helper: /* The result must not be dereferenced !!! */ struct mm_struct *__get_task_mm(struct task_struct *tsk) { if (tsk->flags & PF_KTHREAD) return NULL; return tsk->mm; } Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel thread without PF_BORROWED_MM. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
aab2545f |
|
06-Jun-2008 |
Oleg Nesterov <oleg@tv-sign.ru> |
uml: activate_mm: remove the dead PF_BORROWED_MM check use_mm() was changed to use switch_mm() instead of activate_mm(), since then nobody calls (and nobody should call) activate_mm() with PF_BORROWED_MM bit set. As Jeff Dike pointed out, we can also remove the "old != new" check, it is always true. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c6f3a97f |
|
30-Apr-2008 |
Thomas Gleixner <tglx@linutronix.de> |
debugobjects: add timer specific object debugging code Add calls to the generic object debugging infrastructure and provide fixup functions which allow to keep the system alive when recoverable problems have been detected by the object debugging core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
39fa0031 |
|
29-Apr-2008 |
Jeff Moyer <jmoyer@redhat.com> |
aio: fix misleading comments The FIXME comments are inaccurate. The locking comment over lookup_ioctx() is wrong. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
801678c5 |
|
29-Apr-2008 |
Hirofumi Nakagawa <hnakagawa@miraclelinux.com> |
Remove duplicated unlikely() in IS_ERR() Some drivers have duplicated unlikely() macros. IS_ERR() already has unlikely() in itself. This patch cleans up such pointless code. Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Jeff Garzik <jeff@garzik.org> Cc: Paul Clements <paul.clements@steeleye.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: David Brownell <david-b@pacbell.net> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.de> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d5470b59 |
|
29-Apr-2008 |
Adrian Bunk <bunk@kernel.org> |
fs/aio.c: make 3 functions static Make the following needlessly global functions static: - __put_ioctx() - lookup_ioctx() - io_submit_one() Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e92adcba |
|
28-Apr-2008 |
Jeff Moyer <jmoyer@redhat.com> |
aio: io_getevents() should return if io_destroy() is invoked This patch wakes up a thread waiting in io_getevents if another thread destroys the context. This was tested using a small program that spawns a thread to wait in io_getevents while the parent thread destroys the io context and then waits for the getevents thread to exit. Without this patch, the program hangs indefinitely. With the patch, the program exits as expected. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Christopher Smith <x@xman.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8d1c98b0 |
|
10-Apr-2008 |
Davide Libenzi <davidel@xmailserver.org> |
eventfd/kaio integration fix Jeff Roberson discovered a race when using kaio eventfd based notifications. When it occurs it can lead tomissed wakeups and hung userspace. This patch fixes the race by moving the notification inside the spinlocked section of kaio. The operation is safe since eventfd spinlock and kaio one are unrelated. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jeff Roberson <jroberson@chesapeake.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
598af051 |
|
10-Apr-2008 |
Roland McGrath <roland@redhat.com> |
asmlinkage_protect sys_io_getevents Use asmlinkage_protect in sys_io_getevents, because GCC for i386 with CONFIG_FRAME_POINTER=n can decide to clobber an argument word on the stack, i.e. the user struct pt_regs. Here the problem is not a tail call, but just the compiler's use of the stack when it inlines and optimizes the body of the called function. This seems to avoid it. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6cb2a210 |
|
19-Mar-2008 |
Quentin Barnes <qbarnes+linux@yahoo-inc.com> |
aio: bad AIO race in aio_complete() leads to process hang My group ran into a AIO process hang on a 2.6.24 kernel with the process sleeping indefinitely in io_getevents(2) waiting for the last wakeup to come and it never would. We ran the tests on x86_64 SMP. The hang only occurred on a Xeon box ("Clovertown") but not a Core2Duo ("Conroe"). On the Xeon, the L2 cache isn't shared between all eight processors, but is L2 is shared between between all two processors on the Core2Duo we use. My analysis of the hang is if you go down to the second while-loop in read_events(), what happens on processor #1: 1) add_wait_queue_exclusive() adds thread to ctx->wait 2) aio_read_evt() to check tail 3) if aio_read_evt() returned 0, call [io_]schedule() and sleep In aio_complete() with processor #2: A) info->tail = tail; B) waitqueue_active(&ctx->wait) C) if waitqueue_active() returned non-0, call wake_up() The way the code is written, step 1 must be seen by all other processors before processor 1 checks for pending events in step 2 (that were recorded by step A) and step A by processor 2 must be seen by all other processors (checked in step 2) before step B is done. The race I believed I was seeing is that steps 1 and 2 were effectively swapped due to the __list_add() being delayed by the L2 cache not shared by some of the other processors. Imagine: proc 2: just before step A proc 1, step 1: adds to ctx->wait, but is not visible by other processors yet proc 1, step 2: checks tail and sees no pending events proc 2, step A: updates tail proc 1, step 3: calls [io_]schedule() and sleeps proc 2, step B: checks ctx->wait, but sees no one waiting, skips wakeup so proc 1 sleeps indefinitely My patch adds a memory barrier between steps A and B. It ensures that the update in step 1 gets seen on processor 2 before continuing. If processor 1 was just before step 1, the memory barrier makes sure that step A (update tail) gets seen by the time processor 1 makes it to step 2 (check tail). Before the patch our AIO process would hang virtually 100% of the time. After the patch, we have yet to see the process ever hang. Signed-off-by: Quentin Barnes <qbarnes+linux@yahoo-inc.com> Reviewed-by: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: <stable@kernel.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ We should probably disallow that "if (waitqueue_active()) wake_up()" coding pattern, because it's so often buggy wrt memory ordering ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c2ec6682 |
|
08-Feb-2008 |
Rusty Russell <rusty@rustcorp.com.au> |
aio: negative offset should return -EINVAL An AIO read or write should return -EINVAL if the offset is negative. This check matches the one in pread and pwrite. This was found by the libaio test suite. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7adfa2ff |
|
08-Feb-2008 |
Rusty Russell <rusty@rustcorp.com.au> |
aio: partial write should not return error code When an AIO write gets an error after writing some data (eg. ENOSPC), it should return the amount written already, not the error. Just like write() is supposed to. This was found by the libaio test suite. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-By: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fc9b52cd |
|
08-Feb-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
fs: remove fastcall, it is always empty [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
56c4da45 |
|
30-Jan-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
core: remove last users of empty FASTCALL macro FASTCALL is always empty after the x86 removal. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
#
e00ba3da |
|
05-Dec-2007 |
Jeff Moyer <jmoyer@redhat.com> |
aio: only account I/O wait time in read_events if there are active requests On 2.6.24, top started showing 100% iowait on one CPU when a UML instance was running (but completely idle). The UML code sits in io_getevents waiting for an event to be submitted and completed. Fix this by checking ctx->reqs_active before scheduling to determine whether or not we are waiting for I/O. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Jeff Dike <jdike@addtoit.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6212e3a3 |
|
18-Oct-2007 |
Alexey Dobriyan <adobriyan@gmail.com> |
Remove struct task_struct::io_wait Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b during 2.6.9 development. Commit introduced io_wait field which remained write-only than and still remains write-only. Also garbage collect macros which "use" io_wait. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
41d10da3 |
|
17-Oct-2007 |
Jeff Moyer <jmoyer@redhat.com> |
aio: account I/O wait time properly Some months back I proposed changing the schedule() call in read_events to an io_schedule(): http://osdir.com/ml/linux.kernel.aio.general/2006-10/msg00024.html This was rejected as there are AIO operations that do not initiate disk I/O. I've had another look at the problem, and the only AIO operation that will not initiate disk I/O is IOCB_CMD_NOOP. However, this command isn't even wired up! Given that it doesn't work, and hasn't for *years*, I'm going to suggest again that we do proper I/O accounting when using AIO. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
87e2831c |
|
08-Oct-2007 |
Yan Zheng <yanzheng@21cn.com> |
AIO: fix cleanup in io_submit_one(...) When IOCB_FLAG_RESFD flag is set and iocb->aio_resfd is incorrect, statement 'goto out_put_req' is executed. At label 'out_put_req', aio_put_req(..) is called, which requires 'req->ki_filp' set. Signed-off-by: Yan Zheng<yanzheng@21cn.com> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9c3060be |
|
10-May-2007 |
Davide Libenzi <davidel@xmailserver.org> |
signal/timer/event: KAIO eventfd support example This is an example about how to add eventfd support to the current KAIO code, in order to enable KAIO to post readiness events to a pollable fd (hence compatible with POSIX select/poll). The KAIO code simply signals the eventfd fd when events are ready, and this triggers a POLLIN in the fd. This patch uses a reserved for future use member of the struct iocb to pass an eventfd file descriptor, that KAIO will use to post events every time a request completes. At that point, an aio_getevents() will return the completed result to a struct io_event. I made a quick test program to verify the patch, and it runs fine here: http://www.xmailserver.org/eventfd-aio-test.c The test program uses poll(2), but it'd, of course, work with select and epoll too. This can allow to schedule both block I/O and other poll-able devices requests, and wait for results using select/poll/epoll. In a typical scenario, an application would submit KAIO request using aio_submit(), and will also use epoll_ctl() on the whole other class of devices (that with the addition of signals, timers and user events, now it's pretty much complete), and then would: epoll_wait(...); for_each_event { if (curr_event_is_kaiofd) { aio_getevents(); dispatch_aio_events(); } else { dispatch_epoll_event(); } } Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
28e53bdd |
|
09-May-2007 |
Oleg Nesterov <oleg@tv-sign.ru> |
unify flush_work/flush_work_keventd and rename it to cancel_work_sync flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq (this was possible from the very beginnig, I missed this). So we can unify flush_work_keventd and flush_work. Also, rename flush_work() to cancel_work_sync() and fix all callers. Perhaps this is not the best name, but "flush_work" is really bad. (akpm: this is why the earlier patches bypassed maintainers) Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jeff Garzik <jeff@garzik.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Auke Kok <auke-jan.h.kok@intel.com>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a9df62c7 |
|
09-May-2007 |
Andrew Morton <akpm@osdl.org> |
aio: use flush_work() Migrate AIO over to use flush_work(). Cc: "Maciej W. Rozycki" <macro@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
0a31bd5f |
|
06-May-2007 |
Christoph Lameter <clameter@sgi.com> |
KMEM_CACHE(): simplify slab cache creation This patch provides a new macro KMEM_CACHE(<struct>, <flags>) to simplify slab creation. KMEM_CACHE creates a slab with the name of the struct, with the size of the struct and with the alignment of the struct. Additional slab flags may be specified if necessary. Example struct test_slab { int a,b,c; struct list_head; } __cacheline_aligned_in_smp; test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC) will create a new slab named "test_slab" of the size sizeof(struct test_slab) and aligned to the alignment of test slab. If it fails then we panic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
28defbea |
|
27-Mar-2007 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] aio: remove bare user-triggerable error printk The user can generate console output if they cause do_mmap() to fail during sys_io_setup(). This was seen in a regression test that does exactly that by spinning calling mmap() until it gets -ENOMEM before calling io_setup(). We don't need this printk at all, just remove it. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c3762229 |
|
10-Feb-2007 |
Robert P. J. Day <rpjday@mindspring.com> |
[PATCH] Transform kmem_cache_alloc()+memset(0) -> kmem_cache_zalloc(). Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the corresponding "kmem_cache_zalloc()" call. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@muc.de> Cc: Roland McGrath <roland@redhat.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Greg KH <greg@kroah.com> Acked-by: Joel Becker <Joel.Becker@oracle.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jan Kara <jack@ucw.cz> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
e10a4437 |
|
10-Feb-2007 |
Robert P. J. Day <rpjday@mindspring.com> |
[PATCH] Remove final references to deprecated "MAP_ANON" page protection flag Remove the last vestiges of the long-deprecated "MAP_ANON" page protection flag: use "MAP_ANONYMOUS" instead. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dee11c23 |
|
03-Feb-2007 |
Ken Chen <kenchen@google.com> |
[PATCH] aio: fix buggy put_ioctx call in aio_complete - v2 An AIO bug was reported that sleeping function is being called in softirq context: BUG: warning at kernel/mutex.c:132/__mutex_lock_common() Call Trace: [<a000000100577b00>] __mutex_lock_slowpath+0x640/0x6c0 [<a000000100577ba0>] mutex_lock+0x20/0x40 [<a0000001000a25b0>] flush_workqueue+0xb0/0x1a0 [<a00000010018c0c0>] __put_ioctx+0xc0/0x240 [<a00000010018d470>] aio_complete+0x2f0/0x420 [<a00000010019cc80>] finished_one_bio+0x200/0x2a0 [<a00000010019d1c0>] dio_bio_complete+0x1c0/0x200 [<a00000010019d260>] dio_bio_end_aio+0x60/0x80 [<a00000010014acd0>] bio_endio+0x110/0x1c0 [<a0000001002770e0>] __end_that_request_first+0x180/0xba0 [<a000000100277b90>] end_that_request_chunk+0x30/0x60 [<a0000002073c0c70>] scsi_end_request+0x50/0x300 [scsi_mod] [<a0000002073c1240>] scsi_io_completion+0x200/0x8a0 [scsi_mod] [<a0000002074729b0>] sd_rw_intr+0x330/0x860 [sd_mod] [<a0000002073b3ac0>] scsi_finish_command+0x100/0x1c0 [scsi_mod] [<a0000002073c2910>] scsi_softirq_done+0x230/0x300 [scsi_mod] [<a000000100277d20>] blk_done_softirq+0x160/0x1c0 [<a000000100083e00>] __do_softirq+0x200/0x240 [<a000000100083eb0>] do_softirq+0x70/0xc0 See report: http://marc.theaimsgroup.com/?l=linux-kernel&m=116599593200888&w=2 flush_workqueue() is not allowed to be called in the softirq context. However, aio_complete() called from I/O interrupt can potentially call put_ioctx with last ref count on ioctx and triggers bug. It is simply incorrect to perform ioctx freeing from aio_complete. The bug is trigger-able from a race between io_destroy() and aio_complete(). A possible scenario: cpu0 cpu1 io_destroy aio_complete wait_for_all_aios { __aio_put_req ... ctx->reqs_active--; if (!ctx->reqs_active) return; } ... put_ioctx(ioctx) put_ioctx(ctx); __put_ioctx bam! Bug trigger! The real problem is that the condition check of ctx->reqs_active in wait_for_all_aios() is incorrect that access to reqs_active is not being properly protected by spin lock. This patch adds that protective spin lock, and at the same time removes all duplicate ref counting for each kiocb as reqs_active is already used as a ref count for each active ioctx. This also ensures that buggy call to flush_workqueue() in softirq context is eliminated. Signed-off-by: "Ken Chen" <kenchen@google.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: <stable@kernel.org> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1ebb1101 |
|
29-Dec-2006 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] Fix lock inversion aio_kick_handler() lockdep found a AB BC CA lock inversion in retry-based AIO: 1) The task struct's alloc_lock (A) is acquired in process context with interrupts enabled. An interrupt might arrive and call wake_up() which grabs the wait queue's q->lock (B). 2) When performing retry-based AIO the AIO core registers aio_wake_function() as the wake funtion for iocb->ki_wait. It is called with the wait queue's q->lock (B) held and then tries to add the iocb to the run list after acquiring the ctx_lock (C). 3) aio_kick_handler() holds the ctx_lock (C) while acquiring the alloc_lock (A) via lock_task() and unuse_mm(). Lockdep emits a warning saying that we're trying to connect the irq-safe q->lock to the irq-unsafe alloc_lock via ctx_lock. This fixes the inversion by calling unuse_mm() in the AIO kick handing path after we've released the ctx_lock. As Ben LaHaise pointed out __put_ioctx could set ctx->mm to NULL, so we must only access ctx->mm while we have the lock. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
90aef12e |
|
13-Dec-2006 |
Jeremy Fitzhardinge <jeremy@goop.org> |
[PATCH] Use activate_mm() in fs/aio.c:use_mm() activate_mm() is not the right thing to be using in use_mm(). It should be switch_mm(). On normal x86, they're synonymous, but for the Xen patches I'm adding a hook which assumes that activate_mm is only used the first time a new mm is used after creation (I have another hook for dealing with dup_mm). I think this use of activate_mm() is the only place where it could be used a second time on an mm. >From a quick look at the other architectures I think this is OK (most simply implement one in terms of the other), but some are doing some subtly different stuff between the two. Acked-by: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
97d2a805 |
|
06-Dec-2006 |
Benjamin LaHaise <bcrl@kvack.org> |
[PATCH] aio: remove ki_retried debugging member Remove the ki_retried member from struct kiocb. I think the idea was bounced around a while back, but Arnaldo pointed out another reason that we should dig it up when he pointed out that the last cacheline of struct kiocb only contains 4 bytes. By removing the debugging member, we save more than the 8 byte on 64 bit machines. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Acked-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
b62e8ec2 |
|
06-Dec-2006 |
Kenneth W Chen <kenneth.w.chen@intel.com> |
[PATCH] aio: kill pointless ki_nbytes assignment in aio_setup_single_vector io_submit_one assigns ki_left = ki_nbytes = iocb->aio_nbytes, then calls down to aio_setup_iocb, then to aio_setup_single_vector. In there, ki_nbytes is reassigned to the same value it got two call stack above it. There is no need to do so. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
e18b890b |
|
06-Dec-2006 |
Christoph Lameter <clameter@sgi.com> |
[PATCH] slab: remove kmem_cache_t Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
93e06b41 |
|
29-Nov-2006 |
Eric Sesterhenn <snakebyte@gmx.de> |
BUG_ON conversion for fs/aio.c This patch converts a if () BUG(); construct to BUG_ON(); which occupies less space, uses unlikely and is safer when BUG() is disabled. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
#
65f27f38 |
|
22-Nov-2006 |
David Howells <dhowells@redhat.com> |
WorkStruct: Pass the work_struct pointer instead of context data Pass the work_struct pointer to the work function rather than context data. The work function can use container_of() to work out the data. For the cases where the container of the work_struct may go away the moment the pending bit is cleared, it is made possible to defer the release of the structure by deferring the clearing of the pending bit. To make this work, an extra flag is introduced into the management side of the work_struct. This governs auto-release of the structure upon execution. Ordinarily, the work queue executor would release the work_struct for further scheduling or deallocation by clearing the pending bit prior to jumping to the work function. This means that, unless the driver makes some guarantee itself that the work_struct won't go away, the work function may not access anything else in the work_struct or its container lest they be deallocated.. This is a problem if the auxiliary data is taken away (as done by the last patch). However, if the pending bit is *not* cleared before jumping to the work function, then the work function *may* access the work_struct and its container with no problems. But then the work function must itself release the work_struct by calling work_release(). In most cases, automatic release is fine, so this is the default. Special initiators exist for the non-auto-release case (ending in _NAR). Signed-Off-By: David Howells <dhowells@redhat.com>
|
#
52bad64d |
|
22-Nov-2006 |
David Howells <dhowells@redhat.com> |
WorkStruct: Separate delayable and non-delayable events. Separate delayable work items from non-delayable work items be splitting them into a separate structure (delayed_work), which incorporates a work_struct and the timer_list removed from work_struct. The work_struct struct is huge, and this limits it's usefulness. On a 64-bit architecture it's nearly 100 bytes in size. This reduces that by half for the non-delayable type of event. Signed-Off-By: David Howells <dhowells@redhat.com>
|
#
691578cd |
|
03-Oct-2006 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] pr_debug: aio: use size_t length modifier in pr_debug format arguments aio: use size_t length modifier in pr_debug format arguments Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
eed4e51f |
|
01-Oct-2006 |
Badari Pulavarty <pbadari@us.ibm.com> |
[PATCH] Add vector AIO support This work is initially done by Zach Brown to add support for vectored aio. These are the core changes for AIO to support IOCB_CMD_PREADV/IOCB_CMD_PWRITEV. [akpm@osdl.org: huge build fix] Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
027445c3 |
|
01-Oct-2006 |
Badari Pulavarty <pbadari@us.ibm.com> |
[PATCH] Vectorize aio_read/aio_write fileop methods This patch vectorizes aio_read() and aio_write() methods to prepare for collapsing all aio & vectored operations into one interface - which is aio_read()/aio_write(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Michael Holzheu <HOLZHEU@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
d6e05edc |
|
26-Jun-2006 |
Andreas Mohr <andi@lisas.de> |
spelling fixes acquired (aquired) contiguous (contigious) successful (succesful, succesfull) surprise (suprise) whether (weather) some other misspellings Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
#
626ab0e6 |
|
23-Jun-2006 |
Oleg Nesterov <oleg@tv-sign.ru> |
[PATCH] list: use list_replace_init() instead of list_splice_init() list_splice_init(list, head) does unneeded job if it is known that list_empty(head) == 1. We can use list_replace_init() instead. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
11b0b5ab |
|
25-Mar-2006 |
Oliver Neukum <neukum@fachschaft.cup.uni-muenchen.de> |
[PATCH] use kzalloc and kcalloc in core fs code Signed-off-by: Oliver Neukum <oliver@neukum.name> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
095975da |
|
08-Jan-2006 |
Nick Piggin <nickpiggin@yahoo.com.au> |
[PATCH] rcu file: use atomic primitives Use atomic_inc_not_zero for rcu files instead of special case rcuref. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
d00689af |
|
13-Nov-2005 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] aio: replace locking comments with assert_spin_locked() aio: replace locking comments with assert_spin_locked() Signed-off-by: Zach Brown <zach.brown@oracle.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
20dcae32 |
|
13-Nov-2005 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] aio: remove kioctx from mm_struct Sync iocbs have a life cycle that don't need a kioctx. Their retrying, if any, is done in the context of their owner who has allocated them on the stack. The sole user of a sync iocb's ctx reference was aio_complete() checking for an elevated iocb ref count that could never happen. No path which grabs an iocb ref has access to sync iocbs. If we were to implement sync iocb cancelation it would be done by the owner of the iocb using its on-stack reference. Removing this chunk from aio_complete allows us to remove the entire kioctx instance from mm_struct, reducing its size by a third. On a i386 testing box the slab size went from 768 to 504 bytes and from 5 to 8 per page. Signed-off-by: Zach Brown <zach.brown@oracle.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
d55b5fda |
|
07-Nov-2005 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] aio: remove aio_max_nr accounting race AIO was adding a new context's max requests to the global total before testing if that resulting total was over the global limit. This let innocent tasks get their new limit tested along with a racing guilty task that was crossing the limit. This serializes the _nr accounting with a spinlock It also switches to using unsigned long for the global totals. Individual contexts are still limited to an unsigned int's worth of requests by the syscall interface. The problem and fix were verified with a simple program that spun creating and destroying a context while holding on to another long lived context. Before the patch a task creating a tiny context could get a spurious EAGAIN if it raced with a task creating a very large context that overran the limit. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
8766ce41 |
|
23-Oct-2005 |
Kostik Belousov <konstantin.belousov@zoral.com.ua> |
[PATCH] aio syscalls are not checked by lsm Another case of missing call to security_file_permission: aio functions (namely, io_submit) does not check credentials with security modules. Below is the simple patch to the problem. It seems that it is enough to check for rights at the request submission time. Signed-off-by: Kostik Belousov <kostikbel@gmail.com> Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
4faa5285 |
|
17-Oct-2005 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] aio: revert lock_kiocb() lock_kiocb() was introduced to serialize retrying and cancellation. In the process of doing so it tried to sleep waiting for KIF_LOCKED while holding the ctx_lock spinlock. Recent fixes have ensured that multiple concurrent retries won't be attempted for a given iocb. Cancel has other problems and has no significant in-tree users that have been complaining about it. So for the immediate future we'll revert sleeping with the lock held and will address proper cancellation and retry serialization in the future. Signed-off-by: Zach Brown <zach.brown@oracle.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
353fb07e |
|
30-Sep-2005 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] aio: avoid extra aio_{read,write} call when ki_left == 0 Recently aio_p{read,write} changed to perform retries internally rather than returning -EIOCBRETRY. This inadvertantly resulted in always calling aio_{read,write} with ki_left at 0 which would in turn immediately return 0. Harmless, but we can avoid this call by checking in the caller. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
897f15fb |
|
30-Sep-2005 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] aio: remove unlocked task_list test and resulting race Only one of the run or kick path is supposed to put an iocb on the run list. If both of them do it than one of them can end up referencing a freed iocb. The kick path could delete the task_list item from the wait queue before getting the ctx_lock and putting the iocb on the run list. The run path was testing the task_list item outside the lock so that it could catch ki_retry methods that return -EIOCBRETRY *without* putting the iocb on a wait queue and promising to call kick_iocb. This unlocked check could then race with the kick path to cause both to try and put the iocb on the run list. The patch stops the run path from testing task_list by requring that any ki_retry that returns -EIOCBRETRY *must* guarantee that kick_iocb() will be called in the future. aio_p{read,write}, the only in-tree -EIOCBRETRY users, are updated. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
998765e5 |
|
30-Sep-2005 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] aio: lock around kiocbTryKick() Only one of the run or kick path is supposed to put an iocb on the run list. If both of them do it than one of them can end up referencing a freed iocb. The kick patch could set the Kicked bit before acquiring the ctx_lock and putting the iocb on the run list. The run path, while holding the ctx_lock, could see this partial kick and mistake it for a kick that was deferred while it was doing work with the run_list NULLed out. It would then race with the kick thread to add the iocb to the run list. This patch moves the kick setting under the ctx_lock so that only one of the kick or run path queues the iocb on the run list, as intended. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
a464adeb |
|
16-Sep-2005 |
Zach Brown <zach.brown@oracle.com> |
[PATCH] Add smp_mb__after_clear_bit() to unlock_kiocb() Add smp_mb__after_clear_bit() to unlock_kiocb() AIO's use of wait_on_bit_lock()/wake_up_bit() forgot to add a barrier between clearing its lock bit and calling wake_up_bit() so wake_up_bit()'s unlocked waitqueue_active() can race. This puts AIO's use in line with the others and the comment above wake_up_bit(). Signed-off-by: Zach Brown <zach.brown@oracle.com> Acked-by: Benjamin LaHaise <bcrl@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
ab2af1f5 |
|
09-Sep-2005 |
Dipankar Sarma <dipankar@in.ibm.com> |
[PATCH] files: files struct with RCU Patch to eliminate struct files_struct.file_lock spinlock on the reader side and use rcu refcounting rcuref_xxx api for the f_count refcounter. The updates to the fdtable are done by allocating a new fdtable structure and setting files->fdt to point to the new structure. The fdtable structure is protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A global list of fdtable to be freed is not scalable, so we use a per-cpu list. If keventd is already handling the current cpu's work, we use a timer to defer queueing of that work. Since the last publication, this patch has been re-written to avoid using explicit memory barriers and use rcu_assign_pointer(), rcu_dereference() premitives instead. This required that the fd information is kept in a separate structure (fdtable) and updated atomically. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
ac0b1bc1 |
|
09-Sep-2005 |
Benjamin LaHaise <bcrl@linux.intel.com> |
[PATCH] aio: kiocb locking to serialise retry and cancel Implement a per-kiocb lock to serialise retry operations and cancel. This is done using wait_on_bit_lock() on the KIF_LOCKED bit of kiocb->ki_flags. Also, make the cancellation path lock the kiocb and subsequently release all references to it if the cancel was successful. This version includes a fix for the deadlock with __aio_run_iocbs. Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
8f58202b |
|
09-Sep-2005 |
Wendy Cheng <wcheng@redhat.com> |
[PATCH] change io_cancel return code for no cancel case Note that other than few exceptions, most of the current filesystem and/or drivers do not have aio cancel specifically defined (kiob->ki_cancel field is mostly NULL). However, sys_io_cancel system call universally sets return code to -EAGAIN. This gives applications a wrong impression that this call is implemented but just never works. We have customer inquires about this issue. Changed by Benjamin LaHaise to EINVAL instead of ENOSYS Signed-off-by: S. Wendy Cheng <wcheng@redhat.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
1e40cd38 |
|
03-Sep-2005 |
Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> |
[PATCH] uml: fixes performance regression in activate_mm and thus exec() Normally, activate_mm() is called from exec(), and thus it used to be a no-op because we use a completely new "MM context" on the host (for instance, a new process), and so we didn't need to flush any "TLB entries" (which for us are the set of memory mappings for the host process from the virtual "RAM" file). Kernel threads, instead, are usually handled in a different way. So, when for AIO we call use_mm(), things used to break and so Benjamin implemented activate_mm(). However, that is only needed for AIO, and could slow down exec() inside UML, so be smart: detect being called for AIO (via PF_BORROWED_MM) and do the full flush only in that situation. Comment also the caller so that people won't go breaking UML without noticing. I also rely on the caller's locks for testing current->flags. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> CC: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
c016e225 |
|
28-Jun-2005 |
S�bastien Dugu <sebastien.dugue@bull.net> |
[PATCH] aio-retry-fix: fix aio retry work queueing In the case of buffered AIO, in the aio retry path (aio_run_iocb), when the retry method returns EIOCBRETRY the kicked iocb is added to the context run list but is never queued onto the work queue. The request therefore is never completed. This patch fixes that by adding the appropriate call to aio_queue_work in aio_run_aiocb so that subsequent retries will be handled by the aio worker thread. Signed-off-by: S�bastien Dugu� <sebastien.dugue@bull.net> Acked-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
954d3e95 |
|
01-May-2005 |
Kenneth W Chen <kenneth.w.chen@intel.com> |
[PATCH] aio: optimize io_submit_one() This patch optimizes io_submit_one to call aio_run_iocb() directly if ctx->run_list is empty. When the list is empty, the operation of adding to the list, then call to __aio_run_iocbs() is unnecessary because these operations are done in one atomic step. ctx->run_list always has only one element in this case. This optimization speeds up industry standard db transaction processing benchmark by 0.2%. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
644d3a08 |
|
01-May-2005 |
Kenneth W Chen <kenneth.w.chen@intel.com> |
[PATCH] aio: clean up debug code Clean up code that was previously used for debug purpose. Remove aio_run, aio_wakeups, iocb->ki_queued and iocb->ki_kicked. Also clean up unused variable count in __aio_run_iocbs() and debug code in read_events(). Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
4bf69b2a |
|
01-May-2005 |
Kenneth W Chen <kenneth.w.chen@intel.com> |
[PATCH] aio: ring wrapping simplification Since the tail pointer in aio_ring structure never wrap ring size more than once, so a simple compare is sufficient to wrap the index around. This avoid a more expensive mod operation. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
212079cf |
|
01-May-2005 |
Kenneth W Chen <kenneth.w.chen@intel.com> |
[PATCH] aio: remove superfluous kiocb member initialization This patch removes superfluous kiocb member initialization in the AIO allocation and deallocation path. For example, in really_put_req(), right before kiocb is returned to slab, 5 variables are reset to NULL. The same variables will be initialized at the kiocb allocation time, so why bother reset them knowing that they will be set to valid data at alloc time? Another example: ki_retry is initialized in __aio_get_req, but is initialized again in io_submit_one. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
25ee7e38 |
|
25-Apr-2005 |
Adrian Bunk <bunk@stusta.de> |
[PATCH] fs/aio.c: make some code static This patch makes some needlessly global code static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
1da177e4 |
|
16-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|