#
705bcfcb |
|
12-Dec-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: use splice_copy_file_range() inline helper generic_copy_file_range() is just a wrapper around splice_file_range(), which caps the maximum copy length. The only caller of splice_file_range(), namely __ceph_copy_file_range() is already ready to cope with short copy. Move the length capping into splice_file_range() and replace the exported symbol generic_copy_file_range() with a simple inline helper. Suggested-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-fsdevel/20231204083849.GC32438@lst.de/ Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231212094440.250945-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
0f292086 |
|
12-Dec-2023 |
Amir Goldstein <amir73il@gmail.com> |
splice: return type ssize_t from all helpers Not sure why some splice helpers return long, maybe historic reasons. Change them all to return ssize_t to conform to the splice methods and to the rest of the helpers. Suggested-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20231208-horchen-helium-d3ec1535ede5@brauner/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231212094440.250945-2-amir73il@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
da40448c |
|
30-Nov-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: move file_start_write() into direct_splice_actor() The callers of do_splice_direct() hold file_start_write() on the output file. This may cause file permission hooks to be called indirectly on an overlayfs lower layer, which is on the same filesystem of the output file and could lead to deadlock with fanotify permission events. To fix this potential deadlock, move file_start_write() from the callers into the direct_splice_actor(), so file_start_write() will not be held while splicing from the input file. Suggested-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231128214258.GA2398475@perftesting/ Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231130141624.3338942-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
488e8f68 |
|
30-Nov-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: fork splice_file_range() from do_splice_direct() In preparation of calling do_splice_direct() without file_start_write() held, create a new helper splice_file_range(), to be called from context of ->copy_file_range() methods instead of do_splice_direct(). Currently, the only difference is that splice_file_range() does not take flags argument and that it asserts that file_start_write() is held, but we factor out a common helper do_splice_direct_actor() that will be used later. Use the new helper from __ceph_copy_file_range(), that was incorrectly passing to do_splice_direct() the copy flags argument as splice flags. The value of copy flags in ceph is always 0, so it is a smenatic bug fix. Move the declaration of both helpers to linux/splice.h. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231130141624.3338942-2-amir73il@gmail.com Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
d53471ba6f |
|
23-Nov-2023 |
Amir Goldstein <amir73il@gmail.com> |
splice: remove permission hook from iter_file_splice_write() All the callers of ->splice_write(), (e.g. do_splice_direct() and do_splice()) already check rw_verify_area() for the entire range and perform all the other checks that are in vfs_write_iter(). Instead of creating another tiny helper for special caller, just open-code it. This is needed for fanotify "pre content" events. Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231122122715.2561213-6-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
b70d8e2b |
|
22-Nov-2023 |
Amir Goldstein <amir73il@gmail.com> |
splice: move permission hook out of splice_file_to_pipe() vfs_splice_read() has a permission hook inside rw_verify_area() and it is called from splice_file_to_pipe(), which is called from do_splice() and do_sendfile(). do_sendfile() already has a rw_verify_area() check for the entire range. do_splice() has a rw_verify_check() for the splice to file case, not for the splice from file case. Add the rw_verify_area() check for splice from file case in do_splice() and use a variant of vfs_splice_read() without rw_verify_area() check in splice_file_to_pipe() to avoid the redundant rw_verify_area() checks. This is needed for fanotify "pre content" events. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231122122715.2561213-5-amir73il@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
feebea75 |
|
22-Nov-2023 |
Amir Goldstein <amir73il@gmail.com> |
splice: move permission hook out of splice_direct_to_actor() vfs_splice_read() has a permission hook inside rw_verify_area() and it is called from do_splice_direct() -> splice_direct_to_actor(). The callers of do_splice_direct() (e.g. vfs_copy_file_range()) already call rw_verify_area() for the entire range, but the other caller of splice_direct_to_actor() (nfsd) does not. Add the rw_verify_area() checks in nfsd_splice_read() and use a variant of vfs_splice_read() without rw_verify_area() check in splice_direct_to_actor() to avoid the redundant rw_verify_area() checks. This is needed for fanotify "pre content" events. Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231122122715.2561213-4-amir73il@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
2a33e2dd |
|
22-Nov-2023 |
Amir Goldstein <amir73il@gmail.com> |
splice: remove permission hook from do_splice_direct() All callers of do_splice_direct() have a call to rw_verify_area() for the entire range that is being copied, e.g. by vfs_copy_file_range() or do_sendfile() before calling do_splice_direct(). The rw_verify_area() check inside do_splice_direct() is redundant and is called after sb_start_write(), so it is not "start-write-safe". Remove this redundant check. This is needed for fanotify "pre content" events. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231122122715.2561213-3-amir73il@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
0201ebf2 |
|
28-Jun-2023 |
David Howells <dhowells@redhat.com> |
mm: merge folio_has_private()/filemap_release_folio() call pairs Patch series "mm, netfs, fscache: Stop read optimisation when folio removed from pagecache", v7. This fixes an optimisation in fscache whereby we don't read from the cache for a particular file until we know that there's data there that we don't have in the pagecache. The problem is that I'm no longer using PG_fscache (aka PG_private_2) to indicate that the page is cached and so I don't get a notification when a cached page is dropped from the pagecache. The first patch merges some folio_has_private() and filemap_release_folio() pairs and introduces a helper, folio_needs_release(), to indicate if a release is required. The second patch is the actual fix. Following Willy's suggestions[1], it adds an AS_RELEASE_ALWAYS flag to an address_space that will make filemap_release_folio() always call ->release_folio(), even if PG_private/PG_private_2 aren't set. folio_needs_release() is altered to add a check for this. This patch (of 2): Make filemap_release_folio() check folio_has_private(). Then, in most cases, where a call to folio_has_private() is immediately followed by a call to filemap_release_folio(), we can get rid of the test in the pair. There are a couple of sites in mm/vscan.c that this can't so easily be done. In shrink_folio_list(), there are actually three cases (something different is done for incompletely invalidated buffers), but filemap_release_folio() elides two of them. In shrink_active_list(), we don't have have the folio lock yet, so the check allows us to avoid locking the page unnecessarily. A wrapper function to check if a folio needs release is provided for those places that still need to do it in the mm/ directory. This will acquire additional parts to the condition in a future patch. After this, the only remaining caller of folio_has_private() outside of mm/ is a check in fuse. Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Xiubo Li <xiubli@redhat.com> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
781ca602 |
|
21-Aug-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
splice: Convert page_cache_pipe_buf_confirm() to use a folio Convert buf->page to a folio once instead of five times. There's only one uptodate bit per folio, not per page, so we lose nothing here. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Message-Id: <20230821141541.2535953-1-willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
576d498e |
|
03-Jul-2023 |
Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> |
splice: fsnotify_access(in), fsnotify_modify(out) on success in tee Same logic applies here: this can fill up the pipe, and pollers that rely on getting IN_MODIFY notifications never wake up. Fixes: 983652c69199 ("splice: report related fsnotify events") Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u Link: https://bugs.debian.org/1039488 Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <10d76dd8c85017ae3cd047c9b9a32e26daefdaa2.1688393619.git.nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
7f0f1ea0 |
|
03-Jul-2023 |
Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> |
splice: fsnotify_access(fd)/fsnotify_modify(fd) in vmsplice Same logic applies here: this can fill up the pipe and pollers that rely on getting IN_MODIFY notifications never wake up. Fixes: 983652c69199 ("splice: report related fsnotify events") Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u Link: https://bugs.debian.org/1039488 Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <8d9ad5acb9c5c1dd2376a2ff5da6ac3183115389.1688393619.git.nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
12ee4b66 |
|
03-Jul-2023 |
Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> |
splice: always fsnotify_access(in), fsnotify_modify(out) on success The current behaviour caused an asymmetry where some write APIs (write, sendfile) would notify the written-to/read-from objects, but splice wouldn't. This affected userspace which uses inotify, most notably coreutils tail -f, to monitor pipes. If the pipe buffer had been filled by a splice-family function: * tail wouldn't know and thus wouldn't service the pipe, and * all writes to the pipe would block because it's full, thus service was denied. (For the particular case of tail -f this could be worked around with ---disable-inotify.) Fixes: 983652c69199 ("splice: report related fsnotify events") Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u Link: https://bugs.debian.org/1039488 Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <604ec704d933e0e0121d9e107ce914512e045fad.1688393619.git.nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
0f0fa27b |
|
24-Jul-2023 |
Jan Stancek <jstancek@redhat.com> |
splice, net: Fix splice_to_socket() for O_NONBLOCK socket LTP sendfile07 [1], which expects sendfile() to return EAGAIN when transferring data from regular file to a "full" O_NONBLOCK socket, started failing after commit 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()"). sendfile() no longer immediately returns, but now blocks. Removed sock_sendpage() handled this case by setting a MSG_DONTWAIT flag, fix new splice_to_socket() to do the same for O_NONBLOCK sockets. [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/sendfile/sendfile07.c Fixes: 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()") Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com> Tested-by: Xi Ruoyao <xry111@xry111.site> Link: https://lore.kernel.org/r/023c0e21e595e00b93903a813bc0bfb9a5d7e368.1690219914.git.jstancek@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2e82f6c3 |
|
14-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
splice: simplify a conditional in copy_splice_read Check for -EFAULT instead of wrapping the check in an ret < 0 block. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
0b24be46 |
|
14-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
splice: don't call file_accessed in copy_splice_read copy_splice_read calls into ->read_iter to read the data, which already calls file_accessed. Fixes: 33b3b041543e ("splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
ca2d49f7 |
|
14-Jun-2023 |
David Howells <dhowells@redhat.com> |
splice, net: Fix splice_to_socket() to handle pipe bufs larger than a page splice_to_socket() assumes that a pipe_buffer won't hold more than a single page of data - but this assumption can be violated by skb_splice_bits() when it splices from a socket into a pipe. The problem is that splice_to_socket() doesn't advance the pipe_buffer length and offset when transcribing from the pipe buf into a bio_vec, so if the buf is >PAGE_SIZE, it keeps repeating the same initial chunk and doesn't advance the tail index. It then subtracts this from "remain" and overcounts the amount of data to be sent. The cleanup phase then tries to overclean the pipe, hits an unused pipe buf and a NULL-pointer dereference occurs. Fix this by not restricting the bio_vec size to PAGE_SIZE and instead transcribing the entirety of each pipe_buffer into a single bio_vec and advancing the tail index if remain hasn't hit zero yet. Large bio_vecs will then be split up by iterator functions such as iov_iter_extract_pages(). This resulted in a KASAN report looking like: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] ... RIP: 0010:pipe_buf_release include/linux/pipe_fs_i.h:203 [inline] RIP: 0010:splice_to_socket+0xa91/0xe30 fs/splice.c:933 Fixes: 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()") Reported-by: syzbot+f9e28a23426ac3b24f20@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/0000000000000900e905fdeb8e39@google.com/ Tested-by: syzbot+f9e28a23426ac3b24f20@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: Christian Brauner <brauner@kernel.org> cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/1428985.1686737388@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
219d9205 |
|
07-Jun-2023 |
David Howells <dhowells@redhat.com> |
splice, net: Fix SPLICE_F_MORE signalling in splice_direct_to_actor() splice_direct_to_actor() doesn't manage SPLICE_F_MORE correctly[1] - and, as a result, it incorrectly signals/fails to signal MSG_MORE when splicing to a socket. The problem I'm seeing happens when a short splice occurs because we got a short read due to hitting the EOF on a file: as the length read (read_len) is less than the remaining size to be spliced (len), SPLICE_F_MORE (and thus MSG_MORE) is set. The issue is that, for the moment, we have no way to know *why* the short read occurred and so can't make a good decision on whether we *should* keep MSG_MORE set. MSG_SENDPAGE_NOTLAST was added to work around this, but that is also set incorrectly under some circumstances - for example if a short read fills a single pipe_buffer, but the next read would return more (seqfile can do this). This was observed with the multi_chunk_sendfile tests in the tls kselftest program. Some of those tests would hang and time out when the last chunk of file was less than the sendfile request size: build/kselftest/net/tls -r tls.12_aes_gcm.multi_chunk_sendfile This has been observed before[2] and worked around in AF_TLS[3]. Fix this by making splice_direct_to_actor() always signal SPLICE_F_MORE if we haven't yet hit the requested operation size. SPLICE_F_MORE remains signalled if the user passed it in to splice() but otherwise gets cleared when we've read sufficient data to fulfill the request. If, however, we get a premature EOF from ->splice_read(), have sent at least one byte and SPLICE_F_MORE was not set by the caller, ->splice_eof() will be invoked. Signed-off-by: David Howells <dhowells@redhat.com> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Matthew Wilcox <willy@infradead.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: David Hildenbrand <david@redhat.com> cc: Christian Brauner <brauner@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/499791.1685485603@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/ [2] Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d452d48b9f8b1a7f8152d33ef52cfd7fe1735b0a [3] Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2bfc6685 |
|
07-Jun-2023 |
David Howells <dhowells@redhat.com> |
splice, net: Add a splice_eof op to file-ops and socket-ops Add an optional method, ->splice_eof(), to allow splice to indicate the premature termination of a splice to struct file_operations and struct proto_ops. This is called if sendfile() or splice() encounters all of the following conditions inside splice_direct_to_actor(): (1) the user did not set SPLICE_F_MORE (splice only), and (2) an EOF condition occurred (->splice_read() returned 0), and (3) we haven't read enough to fulfill the request (ie. len > 0 still), and (4) we have already spliced at least one byte. A further patch will modify the behaviour of SPLICE_F_MORE to always be passed to the actor if either the user set it or we haven't yet read sufficient data to fulfill the request. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Matthew Wilcox <willy@infradead.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: David Hildenbrand <david@redhat.com> cc: Christian Brauner <brauner@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: linux-mm@kvack.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2dc334f1 |
|
07-Jun-2023 |
David Howells <dhowells@redhat.com> |
splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage() Replace generic_splice_sendpage() + splice_from_pipe + pipe_to_sendpage() with a net-specific handler, splice_to_socket(), that calls sendmsg() with MSG_SPLICE_PAGES set instead of calling ->sendpage(). MSG_MORE is used to indicate if the sendmsg() is expected to be followed with more data. This allows multiple pipe-buffer pages to be passed in a single call in a BVEC iterator, allowing the processing to be pushed down to a loop in the protocol driver. This helps pave the way for passing multipage folios down too. Protocols that haven't been converted to handle MSG_SPLICE_PAGES yet should just ignore it and do a normal sendmsg() for now - although that may be a bit slower as it may copy everything. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
9eee8bd8 |
|
22-May-2023 |
David Howells <dhowells@redhat.com> |
splice: kdoc for filemap_splice_read() and copy_splice_read() Provide kerneldoc comments for filemap_splice_read() and copy_splice_read(). Signed-off-by: David Howells <dhowells@redhat.com> cc: Christian Brauner <brauner@kernel.org> cc: Christoph Hellwig <hch@lst.de> cc: Jens Axboe <axboe@kernel.dk> cc: Steve French <smfrench@gmail.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-32-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
c6585011 |
|
22-May-2023 |
David Howells <dhowells@redhat.com> |
splice: Remove generic_file_splice_read() Remove generic_file_splice_read() as it has been replaced with calls to filemap_splice_read() and copy_splice_read(). With this, ITER_PIPE is no longer used. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Steve French <smfrench@gmail.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-30-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
b85930a0 |
|
22-May-2023 |
David Howells <dhowells@redhat.com> |
splice: Make splice from a DAX file use copy_splice_read() Make a read splice from a DAX file go directly to copy_splice_read() to do the reading as filemap_splice_read() is unlikely to find any pagecache to splice. I think this affects only erofs, Ext2, Ext4, fuse and XFS. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: linux-erofs@lists.ozlabs.org cc: linux-ext4@vger.kernel.org cc: linux-xfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230522135018.2742245-9-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
aa3dbde8 |
|
22-May-2023 |
David Howells <dhowells@redhat.com> |
splice: Make splice from an O_DIRECT fd use copy_splice_read() Make a read splice from a file descriptor that's open O_DIRECT use copy_splice_read() to do the reading as filemap_splice_read() is unlikely to find any pagecache to splice. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230522135018.2742245-8-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
123856f0 |
|
22-May-2023 |
David Howells <dhowells@redhat.com> |
splice: Check for zero count in vfs_splice_read() Make vfs_splice_read() return immediately if the length is 0. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230522135018.2742245-7-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
6a3f30b8 |
|
22-May-2023 |
David Howells <dhowells@redhat.com> |
splice: Make do_splice_to() generic and export it Rename do_splice_to() to vfs_splice_read() and export it so that it can be used as a helper when calling down to a lower layer filesystem as it performs all the necessary checks[1]. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Miklos Szeredi <miklos@szeredi.hu> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: John Hubbard <jhubbard@nvidia.com> cc: David Hildenbrand <david@redhat.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-unionfs@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/CAJfpeguGksS3sCigmRi9hJdUec8qtM9f+_9jC1rJhsXT+dV01w@mail.gmail.com/ [1] Link: https://lore.kernel.org/r/20230522135018.2742245-6-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
e69f37bc |
|
22-May-2023 |
David Howells <dhowells@redhat.com> |
splice: Clean up copy_splice_read() a bit Do a couple of cleanups to copy_splice_read(): (1) Cast to struct page **, not void *. (2) Simplify the calculation of the number of pages to keep/reclaim in copy_splice_read(). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-5-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
69df79a4 |
|
22-May-2023 |
David Howells <dhowells@redhat.com> |
splice: Rename direct_splice_read() to copy_splice_read() Rename direct_splice_read() to copy_splice_read() to better reflect as to what it does. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Steve French <sfrench@samba.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-4-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
0f99fc51 |
|
24-Apr-2023 |
Jens Axboe <axboe@kernel.dk> |
splice: clear FMODE_NOWAIT on file if splice/vmsplice is used In preparation for pipes setting FMODE_NOWAIT on pipes to indicate that RWF_NOWAIT/IOCB_NOWAIT is fully supported, have splice and vmsplice clear that file flag. Splice holds the pipe lock around IO and cannot easily be refactored to avoid that, as splice and pipes are inherently tied together. By clearing FMODE_NOWAIT if splice is being used on a pipe, other users of the pipe will know that the pipe is no longer safe for RWF_NOWAIT and friends. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
983652c6 |
|
22-Mar-2023 |
Chung-Chiang Cheng <cccheng@synology.com> |
splice: report related fsnotify events The fsnotify ACCESS and MODIFY event are missing when manipulating a file with splice(2). Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Jan Kara <jack@suse.cz> Message-Id: <20230322062519.409752-1-cccheng@synology.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
c3a4aec0 |
|
07-Mar-2023 |
Jiapeng Chong <jiapeng.chong@linux.alibaba.com> |
splice: Remove redundant assignment to ret The variable ret belongs to redundant assignment and can be deleted. fs/splice.c:940:2: warning: Value stored to 'ret' is never read. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4406 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
#
7c8e01eb |
|
15-Feb-2023 |
David Howells <dhowells@redhat.com> |
splice: Export filemap/direct_splice_read() filemap_splice_read() and direct_splice_read() should be exported. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
|
#
33b3b041 |
|
07-Feb-2023 |
David Howells <dhowells@redhat.com> |
splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE Implement a function, direct_file_splice(), that deals with this by using an ITER_BVEC iterator instead of an ITER_PIPE iterator as the former won't free its buffers when reverted. The function bulk allocates all the buffers it thinks it is going to use in advance, does the read synchronously and only then trims the buffer down. The pages we did use get pushed into the pipe. This fixes a problem with the upcoming iov_iter_extract_pages() function, whereby pages extracted from a non-user-backed iterator such as ITER_PIPE aren't pinned. __iomap_dio_rw(), however, calls iov_iter_revert() to shorten the iterator to just the bufferage it is going to use - which has the side-effect of freeing the excess pipe buffers, even though they're attached to a bio and may get written to by DMA (thanks to Hillf Danton for spotting this[1]). This then causes memory corruption that is particularly noticeable when the syzbot test[2] is run. The test boils down to: out = creat(argv[1], 0666); ftruncate(out, 0x800); lseek(out, 0x200, SEEK_SET); in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW); sendfile(out, in, NULL, 0x1dd00); run repeatedly in parallel. What I think is happening is that ftruncate() occasionally shortens the DIO read that's about to be made by sendfile's splice core by reducing i_size. This should be more efficient for DIO read by virtue of doing a bulk page allocation, but slightly less efficient by ignoring any partial page in the pipe. Reported-by: syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230207094731.1390-1-hdanton@sina.com/ [1] Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2] Signed-off-by: Steve French <stfrench@microsoft.com>
|
#
664e4078 |
|
03-Feb-2023 |
Christoph Hellwig <hch@lst.de> |
splice: use bvec_set_page to initialize a bvec Use the bvec_set_page helper to initialize a bvec. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230203150634.3199647-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
de4eda9d |
|
15-Sep-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
use less confusing names for iov_iter direction initializers READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7d690c15 |
|
09-Jun-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() ... and untangle the cleanup on failure to add into pipe. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0d964934 |
|
12-Jun-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
splice: stop abusing iov_iter_advance() to flush a pipe Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother with rather non-obvious use of iov_iter_advance() in there. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
97ef77c5 |
|
29-Jun-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
fs: check FMODE_LSEEK to control internal pipe splicing The original direct splicing mechanism from Jens required the input to be a regular file because it was avoiding the special socket case. It also recognized blkdevs as being close enough to a regular file. But it forgot about chardevs, which behave the same way and work fine here. This is an okayish heuristic, but it doesn't totally work. For example, a few chardevs should be spliceable here. And a few regular files shouldn't. This patch fixes this by instead checking whether FMODE_LSEEK is set, which represents decently enough what we need rewinding for when splicing to internal pipes. Fixes: b92ce5589374 ("[PATCH] splice: add direct fd <-> fd splicing support") Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5100da38 |
|
12-Feb-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm: Convert remove_mapping() to take a folio Add kernel-doc and return the number of pages removed in order to get the statistics right in __invalidate_mapping_pages(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
|
#
b9ccad2e |
|
11-Feb-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
splice: Use a folio in page_cache_pipe_buf_try_steal() This saves a lot of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
|
#
b964bf53 |
|
25-Jan-2021 |
Al Viro <viro@zeniv.linux.org.uk> |
teach sendfile(2) to handle send-to-pipe directly no point going through the intermediate pipe Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
faa97c48 |
|
25-Jan-2021 |
Al Viro <viro@zeniv.linux.org.uk> |
take the guts of file-to-pipe splice into a helper function Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
313d64a3 |
|
24-Jan-2021 |
Al Viro <viro@zeniv.linux.org.uk> |
do_splice_to(): move the logics for limiting the read length in Both callers have the identical logics limiting the amount of data we try to read into pipe - no more than would fit into that pipe. Move that into do_splice_to() itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0f1d344f |
|
09-Jan-2021 |
Pavel Begunkov <asml.silence@gmail.com> |
splice: don't generate zero-len segement bvecs iter_file_splice_write() may spawn bvec segments with zero-length. In preparation for prohibiting them, filter out by hand at splice level. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
ee6e00c8 |
|
22-Oct-2020 |
Jens Axboe <axboe@kernel.dk> |
splice: change exported internal do_splice() helper to take kernel offset With the set_fs change, we can no longer rely on copy_{to,from}_user() accepting a kernel pointer, and it was bad form to do so anyway. Clean this up and change the internal helper that io_uring uses to deal with kernel pointers instead. This puts the offset copy in/out in __do_splice() instead, which just calls the same helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
d1a819a2 |
|
05-Oct-2020 |
Linus Torvalds <torvalds@linux-foundation.org> |
splice: teach splice pipe reading about empty pipe buffers Tetsuo Handa reports that splice() can return 0 before the real EOF, if the data in the splice source pipe is an empty pipe buffer. That empty pipe buffer case doesn't happen in any normal situation, but you can trigger it by doing a write to a pipe that fails due to a page fault. Tetsuo has a test-case to show the behavior: #define _GNU_SOURCE #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> int main(int argc, char *argv[]) { const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600); int pipe_fd[2] = { -1, -1 }; pipe(pipe_fd); write(pipe_fd[1], NULL, 4096); /* This splice() should wait unless interrupted. */ return !splice(pipe_fd[0], NULL, fd, NULL, 65536, 0); } which results in write(5, NULL, 4096) = -1 EFAULT (Bad address) splice(4, NULL, 3, NULL, 65536, 0) = 0 and this can confuse splice() users into believing they have hit EOF prematurely. The issue was introduced when the pipe write code started pre-allocating the pipe buffers before copying data from user space. This is modified verion of Tetsuo's original patch. Fixes: a194dfe6e6f6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot") Link:https://lore.kernel.org/linux-fsdevel/20201005121339.4063-1-penguin-kernel@I-love.SAKURA.ne.jp/ Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
598b3cec |
|
24-Sep-2020 |
Christoph Hellwig <hch@lst.de> |
fs: remove compat_sys_vmsplice Now that import_iovec handles compat iovecs, the native vmsplice syscall can be used for the compat case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
89cd35c5 |
|
24-Sep-2020 |
Christoph Hellwig <hch@lst.de> |
iov_iter: transparently handle compat iovecs in import_iovec Use in compat_syscall to import either native or the compat iovecs, and remove the now superflous compat_import_iovec. This removes the need for special compat logic in most callers, and the remaining ones can still be simplified by using __import_iovec with a bool compat parameter. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
472e5b05 |
|
01-Oct-2020 |
Linus Torvalds <torvalds@linux-foundation.org> |
pipe: remove pipe_wait() and fix wakeup race with splice The pipe splice code still used the old model of waiting for pipe IO by using a non-specific "pipe_wait()" that waited for any pipe event to happen, which depended on all pipe IO being entirely serialized by the pipe lock. So by checking the state you were waiting for, and then adding yourself to the wait queue before dropping the lock, you were guaranteed to see all the wakeups. Strictly speaking, the actual wakeups were not done under the lock, but the pipe_wait() model still worked, because since the waiter held the lock when checking whether it should sleep, it would always see the current state, and the wakeup was always done after updating the state. However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") split the single wait-queue into two, and in the process also made the "wait for event" code wait for _two_ wait queues, and that then showed a race with the wakers that were not serialized by the pipe lock. It's only splice that used that "pipe_wait()" model, so the problem wasn't obvious, but Josef Bacik reports: "I hit a hang with fstest btrfs/187, which does a btrfs send into /dev/null. This works by creating a pipe, the write side is given to the kernel to write into, and the read side is handed to a thread that splices into a file, in this case /dev/null. The box that was hung had the write side stuck here [pipe_write] and the read side stuck here [splice_from_pipe_next -> pipe_wait]. [ more details about pipe_wait() scenario ] The problem is we're doing the prepare_to_wait, which sets our state each time, however we can be woken up either with reads or writes. In the case above we race with the WRITER waking us up, and re-set our state to INTERRUPTIBLE, and thus never break out of schedule" Josef had a patch that avoided the issue in pipe_wait() by just making it set the state only once, but the deeper problem is that pipe_wait() depends on a level of synchonization by the pipe mutex that it really shouldn't. And the whole "wait for any pipe state change" model really isn't very good to begin with. So rather than trying to work around things in pipe_wait(), remove that legacy model of "wait for arbitrary pipe event" entirely, and actually create functions that wait for the pipe actually being readable or writable, and can do so without depending on the pipe lock serializing everything. Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/ Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
36e2c742 |
|
03-Sep-2020 |
Christoph Hellwig <hch@lst.de> |
fs: don't allow splice read/write without explicit ops default_file_splice_write is the last piece of generic code that uses set_fs to make the uaccess routines operate on kernel pointers. It implements a "fallback loop" for splicing from files that do not actually provide a proper splice_read method. The usual file systems and other high bandwidth instances all provide a ->splice_read, so this just removes support for various device drivers and procfs/debugfs files. If splice support for any of those turns out to be important it can be added back by switching them to the iter ops and using generic_file_splice_read. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
566d1362 |
|
19-May-2020 |
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> |
pipe: Fix pipe_full() test in opipe_prep(). syzbot is reporting that splice()ing from non-empty read side to already-full write side causes unkillable task, for opipe_prep() is by error not inverting pipe_full() test. CPU: 0 PID: 9460 Comm: syz-executor.5 Not tainted 5.6.0-rc3-next-20200228-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:rol32 include/linux/bitops.h:105 [inline] RIP: 0010:iterate_chain_key kernel/locking/lockdep.c:369 [inline] RIP: 0010:__lock_acquire+0x6a3/0x5270 kernel/locking/lockdep.c:4178 Call Trace: lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4720 __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x156/0x13c0 kernel/locking/mutex.c:1103 pipe_lock_nested fs/pipe.c:66 [inline] pipe_double_lock+0x1a0/0x1e0 fs/pipe.c:104 splice_pipe_to_pipe fs/splice.c:1562 [inline] do_splice+0x35f/0x1520 fs/splice.c:1141 __do_sys_splice fs/splice.c:1447 [inline] __se_sys_splice fs/splice.c:1427 [inline] __x64_sys_splice+0x2b5/0x320 fs/splice.c:1427 do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reported-by: syzbot+b48daca8639150bc5e73@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=9386d051e11e09973d5a4cf79af5e8cedf79386d Fixes: 8cefc107ca54c8b0 ("pipe: Use head and tail pointers for the ring, not cursor and length") Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c928f642 |
|
20-May-2020 |
Christoph Hellwig <hch@lst.de> |
fs: rename pipe_buf ->steal to ->try_steal And replace the arcane return value convention with a simple bool where true means success and false means failure. [AV: braino fix folded in] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b8d9e7f2 |
|
20-May-2020 |
Christoph Hellwig <hch@lst.de> |
fs: make the pipe_buf_operations ->confirm operation optional Just return 0 for success if it is not present. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
76887c25 |
|
20-May-2020 |
Christoph Hellwig <hch@lst.de> |
fs: make the pipe_buf_operations ->steal operation optional Just return 1 for failure if it is not present. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
f6dd9755 |
|
20-May-2020 |
Christoph Hellwig <hch@lst.de> |
pipe: merge anon_pipe_buf*_ops All the op vectors are exactly the same, they are just used to encode packet or nomerge behavior. There already is a flag for the packet behavior, so just add a new one to allow for merging. Inverting it vs the previous nomerge special casing actually allows for much nicer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
00c285d0 |
|
20-May-2020 |
Christoph Hellwig <hch@lst.de> |
fs: simplify do_splice_from No need for a local function pointer when we can trivial branch on the ->splice_write presence. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2bc01060 |
|
20-May-2020 |
Christoph Hellwig <hch@lst.de> |
fs: simplify do_splice_to No need for a local function pointer when we can trivial branch on the ->splice_read presence. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c73be61c |
|
14-Jan-2020 |
David Howells <dhowells@redhat.com> |
pipe: Add general notification queue support Make it possible to have a general notification queue built on top of a standard pipe. Notifications are 'spliced' into the pipe and then read out. splice(), vmsplice() and sendfile() are forbidden on pipes used for notifications as post_one_notification() cannot take pipe->mutex. This means that notifications could be posted in between individual pipe buffers, making iov_iter_revert() difficult to effect. The way the notification queue is used is: (1) An application opens a pipe with a special flag and indicates the number of messages it wishes to be able to queue at once (this can only be set once): pipe2(fds, O_NOTIFICATION_PIPE); ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); (2) The application then uses poll() and read() as normal to extract data from the pipe. read() will return multiple notifications if the buffer is big enough, but it will not split a notification across buffers - rather it will return a short read or EMSGSIZE. Notification messages include a length in the header so that the caller can split them up. Each message has a header that describes it: struct watch_notification { __u32 type:24; __u32 subtype:8; __u32 info; }; The type indicates the source (eg. mount tree changes, superblock events, keyring changes, block layer events) and the subtype indicates the event type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field indicates a number of things, including the entry length, an ID assigned to a watchpoint contributing to this buffer and type-specific flags. Supplementary data, such as the key ID that generated an event, can be attached in additional slots. The maximum message size is 127 bytes. Messages may not be padded or aligned, so there is no guarantee, for example, that the notification type will be on a 4-byte bounary. Signed-off-by: David Howells <dhowells@redhat.com>
|
#
9dafdfc2 |
|
17-May-2020 |
Pavel Begunkov <asml.silence@gmail.com> |
splice: export do_tee() export do_tee() for use in io_uring Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
90da2e3f |
|
04-May-2020 |
Pavel Begunkov <asml.silence@gmail.com> |
splice: move f_mode checks to do_{splice,tee}() do_splice() is used by io_uring, as will be do_tee(). Move f_mode checks from sys_{splice,tee}() to do_{splice,tee}(), so they're enforced for io_uring as well. Fixes: 7d67af2c0134 ("io_uring: add splice(2) support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
444ebb57 |
|
24-Feb-2020 |
Pavel Begunkov <asml.silence@gmail.com> |
splice: make do_splice public Make do_splice(), so other kernel parts can reuse it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
0ddad21d |
|
09-Dec-2019 |
Linus Torvalds <torvalds@linux-foundation.org> |
pipe: use exclusive waits when reading or writing This makes the pipe code use separate wait-queues and exclusive waiting for readers and writers, avoiding a nasty thundering herd problem when there are lots of readers waiting for data on a pipe (or, less commonly, lots of writers waiting for a pipe to have space). While this isn't a common occurrence in the traditional "use a pipe as a data transport" case, where you typically only have a single reader and a single writer process, there is one common special case: using a pipe as a source of "locking tokens" rather than for data communication. In particular, the GNU make jobserver code ends up using a pipe as a way to limit parallelism, where each job consumes a token by reading a byte from the jobserver pipe, and releases the token by writing a byte back to the pipe. This pattern is fairly traditional on Unix, and works very well, but will waste a lot of time waking up a lot of processes when only a single reader needs to be woken up when a writer releases a new token. A simplified test-case of just this pipe interaction is to create 64 processes, and then pass a single token around between them (this test-case also intentionally passes another token that gets ignored to test the "wake up next" logic too, in case anybody wonders about it): #include <unistd.h> int main(int argc, char **argv) { int fd[2], counters[2]; pipe(fd); counters[0] = 0; counters[1] = -1; write(fd[1], counters, sizeof(counters)); /* 64 processes */ fork(); fork(); fork(); fork(); fork(); fork(); do { int i; read(fd[0], &i, sizeof(i)); if (i < 0) continue; counters[0] = i+1; write(fd[1], counters, (1+(i & 1)) *sizeof(int)); } while (counters[0] < 1000000); return 0; } and in a perfect world, passing that token around should only cause one context switch per transfer, when the writer of a token causes a directed wakeup of just a single reader. But with the "writer wakes all readers" model we traditionally had, on my test box the above case causes more than an order of magnitude more scheduling: instead of the expected ~1M context switches, "perf stat" shows 231,852.37 msec task-clock # 15.857 CPUs utilized 11,250,961 context-switches # 0.049 M/sec 616,304 cpu-migrations # 0.003 M/sec 1,648 page-faults # 0.007 K/sec 1,097,903,998,514 cycles # 4.735 GHz 120,781,778,352 instructions # 0.11 insn per cycle 27,997,056,043 branches # 120.754 M/sec 283,581,233 branch-misses # 1.01% of all branches 14.621273891 seconds time elapsed 0.018243000 seconds user 3.611468000 seconds sys before this commit. After this commit, I get 5,229.55 msec task-clock # 3.072 CPUs utilized 1,212,233 context-switches # 0.232 M/sec 103,951 cpu-migrations # 0.020 M/sec 1,328 page-faults # 0.254 K/sec 21,307,456,166 cycles # 4.074 GHz 12,947,819,999 instructions # 0.61 insn per cycle 2,881,985,678 branches # 551.096 M/sec 64,267,015 branch-misses # 2.23% of all branches 1.702148350 seconds time elapsed 0.004868000 seconds user 0.110786000 seconds sys instead. Much better. [ Note! This kernel improvement seems to be very good at triggering a race condition in the make jobserver (in GNU make 4.2.1) for me. It's a long known bug that was fixed back in June 2017 by GNU make commit b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to avoid hangs."). But there wasn't a new release of GNU make until 4.3 on Jan 19 2020, so a number of distributions may still have the buggy version. Some have backported the fix to their 4.2.1 release, though, and even without the fix it's quite timing-dependent whether the bug actually is hit. ] Josh Triplett says: "I've been hammering on your pipe fix patch (switching to exclusive wait queues) for a month or so, on several different systems, and I've run into no issues with it. The patch *substantially* improves parallel build times on large (~100 CPU) systems, both with parallel make and with other things that use make's pipe-based jobserver. All current distributions (including stable and long-term stable distributions) have versions of GNU make that no longer have the jobserver bug" Tested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a28c8b9d |
|
07-Dec-2019 |
Linus Torvalds <torvalds@linux-foundation.org> |
pipe: remove 'waiting_writers' merging logic This code is ancient, and goes back to when we only had a single page for the pipe buffers. The exact history is hidden in the mists of time (ie "before git", and in fact predates the BK repository too). At that long-ago point in time, it actually helped to try to merge big back-and-forth pipe reads and writes, and not limit pipe reads to the single pipe buffer in length just because that was all we had at a time. However, since then we've expanded the pipe buffers to multiple pages, and this logic really doesn't seem to make sense. And a lot of it is somewhat questionable (ie "hmm, the user asked for a non-blocking read, but we see that there's a writer pending, so let's wait anyway to get the extra data that the writer will have"). But more importantly, it makes the "go to sleep" logic much less obvious, and considering the wakeup issues we've had, I want to make for less of those kinds of things. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ec057595 |
|
06-Dec-2019 |
Linus Torvalds <torvalds@linux-foundation.org> |
pipe: fix incorrect caching of pipe state over pipe_wait() Similarly to commit 8f868d68d335 ("pipe: Fix missing mask update after pipe_wait()") this fixes a case where the pipe rewrite ended up caching the pipe state incorrectly over a pipe lock drop event. It wasn't quite as obvious, because you needed to splice data from a pipe to a file, which is a fairly unusual operation, but it's completely wrong. Make sure we load the pipe head/tail/size information only after we've waited for there to be data in the pipe. While in that file, also make one of the splice helper functions use the canonical arghument order for pipe_empty(). That's syntactic - pipe emptiness is just that head and tail are equal, and thus mixing up head and tail doesn't really matter. It's still wrong, though. Reported-by: David Sterba <dsterba@suse.cz> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6718b6f8 |
|
16-Oct-2019 |
David Howells <dhowells@redhat.com> |
pipe: Allow pipes to have kernel-reserved slots Split pipe->ring_size into two numbers: (1) pipe->ring_size - indicates the hard size of the pipe ring. (2) pipe->max_usage - indicates the maximum number of pipe ring slots that userspace orchestrated events can fill. This allows for a pipe that is both writable by the general kernel notification facility and by userspace, allowing plenty of ring space for notifications to be added whilst preventing userspace from being able to pin too much unswappable kernel space. Signed-off-by: David Howells <dhowells@redhat.com>
|
#
8cefc107 |
|
15-Nov-2019 |
David Howells <dhowells@redhat.com> |
pipe: Use head and tail pointers for the ring, not cursor and length Convert pipes to use head and tail pointers for the buffer ring rather than pointer and length as the latter requires two atomic ops to update (or a combined op) whereas the former only requires one. (1) The head pointer is the point at which production occurs and points to the slot in which the next buffer will be placed. This is equivalent to pipe->curbuf + pipe->nrbufs. The head pointer belongs to the write-side. (2) The tail pointer is the point at which consumption occurs. It points to the next slot to be consumed. This is equivalent to pipe->curbuf. The tail pointer belongs to the read-side. (3) head and tail are allowed to run to UINT_MAX and wrap naturally. They are only masked off when the array is being accessed, e.g.: pipe->bufs[head & mask] This means that it is not necessary to have a dead slot in the ring as head == tail isn't ambiguous. (4) The ring is empty if "head == tail". A helper, pipe_empty(), is provided for this. (5) The occupancy of the ring is "head - tail". A helper, pipe_occupancy(), is provided for this. (6) The number of free slots in the ring is "pipe->ring_size - occupancy". A helper, pipe_space_for_user() is provided to indicate how many slots userspace may use. (7) The ring is full if "head - tail >= pipe->ring_size". A helper, pipe_full(), is provided for this. Signed-off-by: David Howells <dhowells@redhat.com>
|
#
3253d9d0 |
|
15-Oct-2019 |
Darrick J. Wong <darrick.wong@oracle.com> |
splice: only read in as much information as there is pipe buffer space Andreas Grünbacher reports that on the two filesystems that support iomap directio, it's possible for splice() to return -EAGAIN (instead of a short splice) if the pipe being written to has less space available in its pipe buffers than the length supplied by the calling process. Months ago we fixed splice_direct_to_actor to clamp the length of the read request to the size of the splice pipe. Do the same to do_splice. Fixes: 17614445576b6 ("splice: don't read more than available pipe space") Reported-by: syzbot+3c01db6025f26530cf8d@syzkaller.appspotmail.com Reported-by: Andreas Grünbacher <andreas.gruenbacher@gmail.com> Reviewed-by: Andreas Grünbacher <andreas.gruenbacher@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
87e5e6da |
|
14-May-2019 |
Jens Axboe <axboe@kernel.dk> |
uio: make import_iovec()/compat_import_iovec() return bytes on success Currently these functions return < 0 on error, and 0 for success. Change that so that we return < 0 on error, but number of bytes for success. Some callers already treat the return value that way, others need a slight tweak. Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
457c8996 |
|
19-May-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
b9872226 |
|
04-Apr-2019 |
Jann Horn <jannh@google.com> |
tracing: Fix buffer_ref pipe ops This fixes multiple issues in buffer_pipe_buf_ops: - The ->steal() handler must not return zero unless the pipe buffer has the only reference to the page. But generic_pipe_buf_steal() assumes that every reference to the pipe is tracked by the page's refcount, which isn't true for these buffers - buffer_pipe_buf_get(), which duplicates a buffer, doesn't touch the page's refcount. Fix it by using generic_pipe_buf_nosteal(), which refuses every attempted theft. It should be easy to actually support ->steal, but the only current users of pipe_buf_steal() are the virtio console and FUSE, and they also only use it as an optimization. So it's probably not worth the effort. - The ->get() and ->release() handlers can be invoked concurrently on pipe buffers backed by the same struct buffer_ref. Make them safe against concurrency by using refcount_t. - The pointers stored in ->private were only zeroed out when the last reference to the buffer_ref was dropped. As far as I know, this shouldn't be necessary anyway, but if we do it, let's always do it. Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
#
15fab63e |
|
05-Apr-2019 |
Matthew Wilcox <willy@infradead.org> |
fs: prevent page refcount overflow in pipe_buf_get Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
ee5e0011 |
|
07-Feb-2019 |
Slavomir Kaslev <kaslevs@vmware.com> |
fs: Make splice() and tee() take into account O_NONBLOCK flag on pipes The current implementation of splice() and tee() ignores O_NONBLOCK set on pipe file descriptors and checks only the SPLICE_F_NONBLOCK flag for blocking on pipe arguments. This is inconsistent since splice()-ing from/to non-pipe file descriptors does take O_NONBLOCK into consideration. Fix this by promoting O_NONBLOCK, when set on a pipe, to SPLICE_F_NONBLOCK. Some context for how the current implementation of splice() leads to inconsistent behavior. In the ongoing work[1] to add VM tracing capability to trace-cmd we stream tracing data over named FIFOs or vsockets from guests back to the host. When we receive SIGINT from user to stop tracing, we set O_NONBLOCK on the input file descriptor and set SPLICE_F_NONBLOCK for the next call to splice(). If splice() was blocked waiting on data from the input FIFO, after SIGINT splice() restarts with the same arguments (no SPLICE_F_NONBLOCK) and blocks again instead of returning -EAGAIN when no data is available. This differs from the splice() behavior when reading from a vsocket or when we're doing a traditional read()/write() loop (trace-cmd's --nosplice argument). With this patch applied we get the same behavior in all situations after setting O_NONBLOCK which also matches the behavior of doing a read()/write() loop instead of splice(). This change does have potential of breaking users who don't expect EAGAIN from splice() when SPLICE_F_NONBLOCK is not set. OTOH programs that set O_NONBLOCK and don't anticipate EAGAIN are arguably buggy[2]. [1] https://github.com/skaslev/trace-cmd/tree/vsock [2] https://github.com/torvalds/linux/blob/d47e3da1759230e394096fd742aad423c291ba48/fs/read_write.c#L1425 Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
736706be |
|
04-Mar-2019 |
Linus Torvalds <torvalds@linux-foundation.org> |
get rid of legacy 'get_ds()' function Every in-kernel use of this function defined it to KERNEL_DS (either as an actual define, or as an inline function). It's an entirely historical artifact, and long long long ago used to actually read the segment selector valueof '%ds' on x86. Which in the kernel is always KERNEL_DS. Inspired by a patch from Jann Horn that just did this for a very small subset of users (the ones in fs/), along with Al who suggested a script. I then just took it to the logical extreme and removed all the remaining gunk. Roughly scripted with git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/' git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d' plus manual fixups to remove a few unusual usage patterns, the couple of inline function cases and to fix up a comment that had become stale. The 'get_ds()' function remains in an x86 kvm selftest, since in user space it actually does something relevant. Inspired-by: Jann Horn <jannh@google.com> Inspired-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
01e7187b |
|
23-Jan-2019 |
Jann Horn <jannh@google.com> |
pipe: stop using ->can_merge Al Viro pointed out that since there is only one pipe buffer type to which new data can be appended, it isn't necessary to have a ->can_merge field in struct pipe_buf_operations, we can just check for a magic type. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a0ce2f0a |
|
23-Jan-2019 |
Jann Horn <jannh@google.com> |
splice: don't merge into linked buffers Before this patch, it was possible for two pipes to affect each other after data had been transferred between them with tee(): ============ $ cat tee_test.c int main(void) { int pipe_a[2]; if (pipe(pipe_a)) err(1, "pipe"); int pipe_b[2]; if (pipe(pipe_b)) err(1, "pipe"); if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write"); if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee"); if (write(pipe_b[1], "xx", 2) != 2) err(1, "write"); char buf[5]; if (read(pipe_a[0], buf, 4) != 4) err(1, "read"); buf[4] = 0; printf("got back: '%s'\n", buf); } $ gcc -o tee_test tee_test.c $ ./tee_test got back: 'abxx' $ ============ As suggested by Al Viro, fix it by creating a separate type for non-mergeable pipe buffers, then changing the types of buffers in splice_pipe_to_pipe() and link_pipe(). Cc: <stable@vger.kernel.org> Fixes: 7c77f0b3f920 ("splice: implement pipe to pipe splicing") Fixes: 70524490ee2e ("[PATCH] splice: add support for sys_tee()") Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
17614445 |
|
30-Nov-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
splice: don't read more than available pipe space In commit 4721a601099, we tried to fix a problem wherein directio reads into a splice pipe will bounce EFAULT/EAGAIN all the way out to userspace by simulating a zero-byte short read. This happens because some directio read implementations (xfs) will call bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous reads, but as soon as we run out of pipe buffers that _get_pages call returns EFAULT, which the splice code translates to EAGAIN and bounces out to userspace. In that commit, the iomap code catches the EFAULT and simulates a zero-byte read, but that causes assertion errors on regular splice reads because xfs doesn't allow short directio reads. The brokenness is compounded by splice_direct_to_actor immediately bailing on do_splice_to returning <= 0 without ever calling ->actor (which empties out the pipe), so if userspace calls back we'll EFAULT again on the full pipe, and nothing ever gets copied. Therefore, teach splice_direct_to_actor to clamp its requests to the amount of free space in the pipe and remove the simulated short read from the iomap directio code. Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill") Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Ranted-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
aa563d7b |
|
19-Oct-2018 |
David Howells <dhowells@redhat.com> |
iov_iter: Separate type from direction and use accessor functions In the iov_iter struct, separate the iterator type from the iterator direction and use accessor functions to access them in most places. Convert a bunch of places to use switch-statements to access them rather then chains of bitwise-AND statements. This makes it easier to add further iterator types. Also, this can be more efficient as to implement a switch of small contiguous integers, the compiler can use ~50% fewer compare instructions than it has to use bitwise-and instructions. Further, cease passing the iterator type into the iterator setup function. The iterator function can set that itself. Only the direction is required. Signed-off-by: David Howells <dhowells@redhat.com>
|
#
6da2ec56 |
|
12-Jun-2018 |
Kees Cook <keescook@chromium.org> |
treewide: kmalloc() -> kmalloc_array() The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
87a3002a |
|
26-May-2018 |
Al Viro <viro@zeniv.linux.org.uk> |
vmsplice(): lift importing iovec into vmsplice(2) and compat counterpart ... getting rid of transformations in the latter - just use compat_import_iovec(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
30cfe4ef |
|
17-Mar-2018 |
Dominik Brodowski <linux@dominikbrodowski.net> |
fs: add do_vmsplice() helper; remove in-kernel call to syscall Using the fs-internal do_vmsplice() helper allows us to get rid of the fs-internal call to the sys_vmsplice() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
|
#
6aa7de05 |
|
23-Oct-2017 |
Mark Rutland <mark.rutland@arm.com> |
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
ac452aca |
|
01-Sep-2017 |
Christoph Hellwig <hch@lst.de> |
fs: move kernel_write to fs/read_write.c Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
abbb6589 |
|
27-May-2017 |
Christoph Hellwig <hch@lst.de> |
fs: implement vfs_iter_write using do_iter_write De-dupliate some code and allow for passing the flags argument to vfs_iter_write. Additionally it now properly updates timestamps. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
174cd4b1 |
|
02-Feb-2017 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
bb7462b6 |
|
20-Feb-2017 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: use helpers for calling f_op->{read,write}_iter() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
#
5a81e6a1 |
|
16-Feb-2017 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: fix uninitialized flags in splice_to_pipe() Flags (PIPE_BUF_FLAG_PACKET, PIPE_BUF_FLAG_GIFT) could remain on the unused part of the pipe ring buffer. Previously splice_to_pipe() left the flags value alone, which could result in incorrect behavior. Uninitialized flags appears to have been there from the introduction of the splice syscall. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # 2.6.17+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
13c0f52b |
|
10-Dec-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
make nr_pages calculation in default_file_splice_read() a bit less ugly It's an artifact of lousy calling conventions of iov_iter_get_pages_alloc(). Hopefully, we'll get something saner come next cycle; for now that'll do. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
3d6ea290 |
|
10-Dec-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
splice/tee/vmsplice: validate flags Long overdue... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
23c832b1 |
|
12-Oct-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
remove spd_release_page() no users left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
52bce911 |
|
21-Dec-2016 |
Linus Torvalds <torvalds@linux-foundation.org> |
splice: reinstate SIGPIPE/EPIPE handling Commit 8924feff66f3 ("splice: lift pipe_lock out of splice_to_pipe()") caused a regression when there were no more readers left on a pipe that was being spliced into: rather than the expected SIGPIPE and -EPIPE return value, the writer would end up waiting forever for space to free up (which obviously was not going to happen with no readers around). Fixes: 8924feff66f3 ("splice: lift pipe_lock out of splice_to_pipe()") Reported-and-tested-by: Andreas Schwab <schwab@linux-m68k.org> Debugged-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org # v4.9 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8e54cada |
|
26-Nov-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
fix default_file_splice_read() Botched calculation of number of pages. As the result, we were dropping pieces when doing splice to pipe from e.g. 9p. Reported-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e519e777 |
|
10-Nov-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
splice: remove detritus from generic_file_splice_read() i_size check is a leftover from the horrors that used to play with the page cache in that function. With the switch to ->read_iter(), it's neither needed nor correct - for gfs2 it ends up being buggy, since i_size is not guaranteed to be correct until later (inside ->read_iter()). Spotted-by: Abhi Das <adas@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
be297968 |
|
01-Nov-2016 |
Christoph Hellwig <hch@lst.de> |
mm: only include blk_types in swap.h if CONFIG_SWAP is enabled It's only needed for the CONFIG_SWAP-only use of bio_end_io_t. Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some ifdefs in blk_types.h. Instead we'll need to add a few explicit includes that were implicit before, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
c3a69024 |
|
10-Oct-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
fix ITER_PIPE interaction with direct_IO by making sure we call iov_iter_advance() on original iov_iter even if direct_IO (done on its copy) has returned 0. It's a no-op for old iov_iter flavours and does the right thing (== truncation of the stuff we'd allocated, but not filled) in ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with by cleanup in generic_file_read_iter(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
fba597db |
|
27-Sep-2016 |
Miklos Szeredi <mszeredi@redhat.com> |
pipe: add pipe_buf_confirm() helper Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a779638c |
|
27-Sep-2016 |
Miklos Szeredi <mszeredi@redhat.com> |
pipe: add pipe_buf_release() helper Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7bf2d1df |
|
27-Sep-2016 |
Miklos Szeredi <mszeredi@redhat.com> |
pipe: add pipe_buf_get() helper Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
523ac9af |
|
23-Sep-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
switch default_file_splice_read() to use of pipe-backed iov_iter we only use iov_iter_get_pages_alloc() and iov_iter_advance() - pages are filled by kernel_readv() via a kvec array (as we used to do all along), so iov_iter here is used only as a way of arranging for those pages to be in pipe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
82c156f8 |
|
22-Sep-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
switch generic_file_splice_read() to use of ->read_iter() ... and kill the ->splice_read() instances that can be switched to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
241699cd |
|
22-Sep-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
new iov_iter flavour: pipe-backed iov_iter variant for passing data into pipe. copy_to_iter() copies data into page(s) it has allocated and stuffs them into the pipe; copy_page_to_iter() stuffs there a reference to the page given to it. Both will try to coalesce if possible. iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages() and friends will do as copy_to_iter() would have and return the pages where the data would've been copied. iov_iter_advance() will truncate everything past the spot it has advanced to. New primitive: iov_iter_pipe(), used for initializing those. pipe should be locked all along. Running out of space acts as fault would for iovec-backed ones; in other words, giving it to ->read_iter() may result in short read if the pipe overflows, or -EFAULT if it happens with nothing copied there. In other words, ->read_iter() on those acts pretty much like ->splice_read(). Moreover, all generic_file_splice_read() users, as well as many other ->splice_read() instances can be switched to that scheme - that'll happen in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
79fddc4e |
|
17-Sep-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
new helper: add_to_pipe() single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched to that, leaving splice_to_pipe() only for ->splice_read() instances (and that only until they are converted as well). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8924feff |
|
17-Sep-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
splice: lift pipe_lock out of splice_to_pipe() * splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock * ->splice_read() instances do the same * vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe()) arrange for waiting, looping, etc. themselves. That should make pipe_lock the outermost one. Unfortunately, existing rules for the amount passed by vmsplice_to_pipe() and do_splice() are quite ugly _and_ userland code can be easily broken by changing those. It's not even "no more than the maximal capacity of this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe, leave instead of waiting". Considering how poorly these rules are documented, let's try "wait for some space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe and if we run into overflow, we are done". Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
db85a9eb |
|
17-Sep-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
splice: switch get_iovec_page_array() to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e7c3c646 |
|
17-Sep-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
splice_to_pipe(): don't open-code wakeup_pipe_readers() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
09cbfeaf |
|
01-Apr-2016 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
03cc0789 |
|
02-Apr-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
do_splice_to(): cap the size before passing to ->splice_read() pipe capacity won't exceed 2G anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d6785d91 |
|
10-Mar-2016 |
Rabin Vincent <rabin@rab.in> |
splice: handle zero nr_pages in splice_to_pipe() Running the following command: busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null with any tracing enabled pretty very quickly leads to various NULL pointer dereferences and VM BUG_ON()s, such as these: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: [<ffffffff8119df6c>] generic_pipe_buf_release+0xc/0x40 Call Trace: [<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0 [<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10 [<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0 [<ffffffff81196869>] do_sendfile+0x199/0x380 [<ffffffff81197600>] SyS_sendfile64+0x90/0xa0 [<ffffffff8192cbee>] entry_SYSCALL_64_fastpath+0x12/0x6d page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0) kernel BUG at include/linux/mm.h:367! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC RIP: [<ffffffff8119df9c>] generic_pipe_buf_release+0x3c/0x40 Call Trace: [<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0 [<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10 [<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0 [<ffffffff81196869>] do_sendfile+0x199/0x380 [<ffffffff81197600>] SyS_sendfile64+0x90/0xa0 [<ffffffff8192cd1e>] tracesys_phase2+0x84/0x89 (busybox's cat uses sendfile(2), unlike the coreutils version) This is because tracing_splice_read_pipe() can call splice_to_pipe() with spd->nr_pages == 0. spd_pages underflows in splice_to_pipe() and we fill the page pointers and the other fields of the pipe_buffers with garbage. All other callers of splice_to_pipe() avoid calling it when nr_pages == 0, and we could make tracing_splice_read_pipe() do that too, but it seems reasonable to have splice_to_page() handle this condition gracefully. Cc: stable@vger.kernel.org Signed-off-by: Rabin Vincent <rabin@rab.in> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
793b80ef |
|
03-Mar-2016 |
Christoph Hellwig <hch@lst.de> |
vfs: pass a flags argument to vfs_readv/vfs_writev This way we can set kiocb flags also from the sync read/write path for the read_iter/write_iter operations. For now there is no way to pass flags to plain read/write operations as there is no real need for that, and all flags passed are explicitly rejected for these files. Signed-off-by: Milosz Tanski <milosz@adfin.com> [hch: rebased on top of my kiocb changes] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stephen Bates <stephen.bates@pmcs.com> Tested-by: Stephen Bates <stephen.bates@pmcs.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
90330e68 |
|
18-Dec-2015 |
Abhi Das <adas@redhat.com> |
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE During testing, I discovered that __generic_file_splice_read() returns 0 (EOF) when aops->readpage fails with AOP_TRUNCATED_PAGE on the first page of a single/multi-page splice read operation. This EOF return code causes the userspace test to (correctly) report a zero-length read error when it was expecting otherwise. The current strategy of returning a partial non-zero read when ->readpage returns AOP_TRUNCATED_PAGE works only when the failed page is not the first of the lot being processed. This patch attempts to retry lookup and call ->readpage again on pages that had previously failed with AOP_TRUNCATED_PAGE. With this patch, my tests pass and I haven't noticed any unwanted side effects. This version removes the thrice-retry loop and instead indefinitely retries lookups on AOP_TRUNCATED_PAGE errors from ->readpage. This behavior is now similar to do_generic_file_read(). Signed-off-by: Abhi Das <adas@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c2489e07 |
|
23-Nov-2015 |
Jan Kara <jack@suse.cz> |
vfs: Avoid softlockups with sendfile(2) The following test program from Dmitry can cause softlockups or RCU stalls as it copies 1GB from tmpfs into eventfd and we don't have any scheduling point at that path in sendfile(2) implementation: int r1 = eventfd(0, 0); int r2 = memfd_create("", 0); unsigned long n = 1<<30; fallocate(r2, 0, 0, n); sendfile(r1, r2, 0, n); Add cond_resched() into __splice_from_pipe() to fix the problem. CC: Dmitry Vyukov <dvyukov@google.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c725bfce |
|
23-Nov-2015 |
Jan Kara <jack@suse.cz> |
vfs: Make sendfile(2) killable even better Commit 296291cdd162 (mm: make sendfile(2) killable) fixed an issue where sendfile(2) was doing a lot of tiny writes into a filesystem and thus was unkillable for a long time. However sendfile(2) can be (mis)used to issue lots of writes into arbitrary file descriptor such as evenfd or similar special file descriptors which never hit the standard filesystem write path and thus are still unkillable. E.g. the following example from Dmitry burns CPU for ~16s on my test system without possibility to be killed: int r1 = eventfd(0, 0); int r2 = memfd_create("", 0); unsigned long n = 1<<30; fallocate(r2, 0, 0, n); sendfile(r1, r2, 0, n); There are actually quite a few tests for pending signals in sendfile code however we data to write is always available none of them seems to trigger. So fix the problem by adding a test for pending signal into splice_from_pipe_next() also before the loop waiting for pipe buffers to be available. This should fix all the lockup issues with sendfile of the do-ton-of-tiny-writes nature. CC: stable@vger.kernel.org Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c62d2555 |
|
06-Nov-2015 |
Michal Hocko <mhocko@suse.com> |
mm, fs: introduce mapping_gfp_constraint() There are many places which use mapping_gfp_mask to restrict a more generic gfp mask which would be used for allocations which are not directly related to the page cache but they are performed in the same context. Let's introduce a helper function which makes the restriction explicit and easier to track. This patch doesn't introduce any functional changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6afdb859 |
|
24-Jun-2015 |
Michal Hocko <mhocko@suse.cz> |
mm: do not ignore mapping_gfp_mask in page cache allocation paths page_cache_read, do_generic_file_read, __generic_file_splice_read and __ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling add_to_page_cache_lru which might cause recursion into fs down in the direct reclaim path if the mapping really relies on GFP_NOFS semantic. This doesn't seem to be the case now because page_cache_read (page fault path) doesn't seem to suffer from the reclaim recursion issues and do_generic_file_read and __generic_file_splice_read also shouldn't be called under fs locks which would deadlock in the reclaim path. Anyway it is better to obey mapping gfp mask and prevent from later breakage. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2b514574 |
|
21-May-2015 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
net: af_unix: implement splice for stream af_unix sockets unix_stream_recvmsg is refactored to unix_stream_read_generic in this patch and enhanced to deal with pipe splicing. The refactoring is inneglible, we mostly have to deal with a non-existing struct msghdr argument. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0ff28d9f |
|
06-May-2015 |
Christophe Leroy <christophe.leroy@c-s.fr> |
splice: sendfile() at once fails for big files Using sendfile with below small program to get MD5 sums of some files, it appear that big files (over 64kbytes with 4k pages system) get a wrong MD5 sum while small files get the correct sum. This program uses sendfile() to send a file to an AF_ALG socket for hashing. /* md5sum2.c */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <fcntl.h> #include <sys/socket.h> #include <sys/stat.h> #include <sys/types.h> #include <linux/if_alg.h> int main(int argc, char **argv) { int sk = socket(AF_ALG, SOCK_SEQPACKET, 0); struct stat st; struct sockaddr_alg sa = { .salg_family = AF_ALG, .salg_type = "hash", .salg_name = "md5", }; int n; bind(sk, (struct sockaddr*)&sa, sizeof(sa)); for (n = 1; n < argc; n++) { int size; int offset = 0; char buf[4096]; int fd; int sko; int i; fd = open(argv[n], O_RDONLY); sko = accept(sk, NULL, 0); fstat(fd, &st); size = st.st_size; sendfile(sko, fd, &offset, size); size = read(sko, buf, sizeof(buf)); for (i = 0; i < size; i++) printf("%2.2x", buf[i]); printf(" %s\n", argv[n]); close(fd); close(sko); } exit(0); } Test below is done using official linux patch files. First result is with a software based md5sum. Second result is with the program above. root@vgoip:~# ls -l patch-3.6.* -rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz -rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz root@vgoip:~# md5sum patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz root@vgoip:~# ./md5sum2 patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz 5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz After investivation, it appears that sendfile() sends the files by blocks of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each block, the SPLICE_F_MORE flag is missing, therefore the hashing operation is reset as if it was the end of the file. This patch adds SPLICE_F_MORE to the flags when more data is pending. With the patch applied, we get the correct sums: root@vgoip:~# md5sum patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz root@vgoip:~# ./md5sum2 patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
be64f884 |
|
15-Apr-2015 |
Boaz Harrosh <boaz@plexistor.com> |
dax: unify ext2/4_{dax,}_file_operations The original dax patchset split the ext2/4_file_operations because of the two NULL splice_read/splice_write in the dax case. In the vfs if splice_read/splice_write are NULL we then call default_splice_read/write. What we do here is make generic_file_splice_read aware of IS_DAX() so the original ext2/4_file_operations can be used as is. For write it appears that iter_file_splice_write is just fine. It uses the regular f_op->write(file,..) or new_sync_write(file, ...). Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
345995fa |
|
21-Mar-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
vmsplice_to_user(): switch to import_iovec() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e2e40f2c |
|
22-Feb-2015 |
Christoph Hellwig <hch@lst.de> |
fs: move struct kiocb to fs.h struct kiocb now is a generic I/O container, so move it to fs.h. Also do a #include diet for aio.h while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
dbe4e192 |
|
25-Jan-2015 |
Christoph Hellwig <hch@lst.de> |
fs: add vfs_iter_{read,write} helpers Simple helpers that pass an arbitrary iov_iter to filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
05afcb77 |
|
22-Jan-2015 |
Al Viro <viro@zeniv.linux.org.uk> |
new helper: iov_iter_bvec() similar to iov_iter_kvec(), for ITER_BVEC ones Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1c118596 |
|
23-Oct-2014 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: export do_splice_direct() to modules Export do_splice_direct() to modules. Needed by overlay filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
#
5f073850 |
|
05-Apr-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
kill generic_file_splice_write() no callers left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
96f9bc8f |
|
05-Apr-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
fs/splice.c: remove unneeded exports ocfs2 was using a bunch of splice.c guts... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8d020765 |
|
05-Apr-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
->splice_write() via ->write_iter() iter_file_splice_write() - a ->splice_write() instance that gathers the pipe buffers, builds a bio_vec-based iov_iter covering those and feeds it to ->write_iter(). A bunch of simple cases coverted to that... [AV: fixed the braino spotted by Cyrill] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b6dd6f47 |
|
27-May-2014 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: fix vmplice_to_user() Commit 6130f5315ee8 "switch vmsplice_to_user() to copy_page_to_iter()" in v3.15-rc1 broke vmsplice(2). This patch fixes two bugs: - count is not initialized to a proper value, which resulted in no data being copied - if rw_copy_check_uvector() returns negative then the iov might be leaked. Tested OK. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
71d8e532 |
|
05-Mar-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
start adding the tag to iov_iter For now, just use the same thing we pass to ->direct_IO() - it's all iovec-based at the moment. Pass it explicitly to iov_iter_init() and account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO() uses. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6130f531 |
|
03-Feb-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
switch vmsplice_to_user() to copy_page_to_iter() I've switched the sanity checks on iovec to rw_copy_check_uvector(); we might need to do a local analog, if any behaviour differences are not actually bugfixes here... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
fbb32750 |
|
02-Feb-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
pipe: kill ->map() and ->unmap() all pipe_buffer_operations have the same instances of those... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
28a625cb |
|
22-Jan-2014 |
Miklos Szeredi <mszeredi@suse.cz> |
fuse: fix pipe_buf_operations Having this struct in module memory could Oops when if the module is unloaded while the buffer still persists in a pipe. Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal merge them into nosteal_pipe_buf_ops (this is the same as default_pipe_buf_ops except stealing the page from the buffer is not allowed). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org
|
#
72c2d531 |
|
22-Sep-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
file->f_op is never NULL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
18c67cb9 |
|
19-Jun-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
splice: lift checks from do_splice_from() into callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
50cd2c57 |
|
23-May-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
lift file_*_write out of do_splice_direct() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
500368f7 |
|
23-May-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
lift file_*_write out of do_splice_from() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
acdb37c3 |
|
22-Jun-2013 |
Randy Dunlap <rdunlap@infradead.org> |
fs: fix new splice.c kernel-doc warning Fix new kernel-doc warning in fs/splice.c: Warning(fs/splice.c:1298): No description found for parameter 'opos' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
7995bd28 |
|
20-Jun-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
splice: don't pass the address of ->f_pos to methods Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7bee130e |
|
21-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of alloc_pipe_info() argument not used anymore Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6447a3cf |
|
21-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of pipe->inode it's used only as a flag to distinguish normal pipes/FIFOs from the internal per-task one used by file-to-file splice. And pipe->files would work just as well for that purpose... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2dd8c9ad |
|
20-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
lift sb_start_write out of ->splice_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
17338fcc |
|
20-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
lift sb_start_write into default_file_splice_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
03d95eb2 |
|
20-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
lift sb_start_write() out of ->write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
06ae43f3 |
|
20-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
Don't bother with redoing rw_verify_area() from default_file_splice_from() default_file_splice_from() ends up calling vfs_write() (via very convoluted callchain). It's an overkill, since we already have done rw_verify_area() in the caller by the time we call vfs_write() we are under set_fs(KERNEL_DS), so access_ok() is also pointless. Add a new helper (__kernel_write()), use it instead of kernel_write() in there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
76b021d0 |
|
02-Mar-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
convert vmsplice to COMPAT_SYSCALL_DEFINE Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7bb307e8 |
|
23-Feb-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
export kernel_write(), convert open-coded instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
496ad9aa |
|
23-Jan-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
new helper: file_inode(file) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ae62ca7b |
|
06-Jan-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: fix MSG_SENDPAGE_NOTLAST logic commit 35f9c09fe9c72e (tcp: tcp_sendpages() should call tcp_push() once) added an internal flag : MSG_SENDPAGE_NOTLAST meant to be set on all frags but the last one for a splice() call. The condition used to set the flag in pipe_to_sendpage() relied on splice() user passing the exact number of bytes present in the pipe, or a smaller one. But some programs pass an arbitrary high value, and the test fails. The effect of this bug is a lack of tcp_push() at the end of a splice(pipe -> socket) call, and possibly very slow or erratic TCP sessions. We should both test sd->total_len and fact that another fragment is in the pipe (pipe->nrbufs > 1) Many thanks to Willy for providing very clear bug report, bisection and test programs. Reported-by: Willy Tarreau <w@1wt.eu> Bisected-by: Willy Tarreau <w@1wt.eu> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d0e1d66b |
|
11-Dec-2012 |
Namjae Jeon <linkinjeon@gmail.com> |
writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr() There is no reason to pass the nr_pages_dirtied argument, because nr_pages_dirtied value from the caller is unused in balance_dirty_pages_ratelimited_nr(). Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2903ff01 |
|
27-Aug-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
switch simple cases of fget_light to fdget Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
14da9200 |
|
12-Jun-2012 |
Jan Kara <jack@suse.cz> |
fs: Protect write paths by sb_start_write - sb_end_write There are several entry points which dirty pages in a filesystem. mmap (handled by block_page_mkwrite()), buffered write (handled by __generic_file_aio_write()), splice write (generic_file_splice_write), truncate, and fallocate (these can dirty last partial page - handled inside each filesystem separately). Protect these places with sb_start_write() and sb_end_write(). ->page_mkwrite() calls are particularly complex since they are called with mmap_sem held and thus we cannot use standard sb_start_write() due to lock ordering constraints. We solve the problem by using a special freeze protection sb_start_pagefault() which ranks below mmap_sem. BugLink: https://bugs.launchpad.net/bugs/897421 Tested-by: Kamal Mostafa <kamal@canonical.com> Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Massimo Morana <massimo.morana@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
047fe360 |
|
12-Jun-2012 |
Eric Dumazet <edumazet@google.com> |
splice: fix racy pipe->buffers uses Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered by splice_shrink_spd() called from vmsplice_to_pipe() commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes) added capability to adjust pipe->buffers. Problem is some paths don't hold pipe mutex and assume pipe->buffers doesn't change for their duration. Fix this by adding nr_pages_max field in struct splice_pipe_desc, and use it in place of pipe->buffers where appropriate. splice_shrink_spd() loses its struct pipe_inode_info argument. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tom Herbert <therbert@google.com> Cc: stable <stable@vger.kernel.org> # 2.6.35 Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
c3b2da31 |
|
26-Mar-2012 |
Josef Bacik <josef@redhat.com> |
fs: introduce inode operation ->update_time Btrfs has to make sure we have space to allocate new blocks in order to modify the inode, so updating time can fail. We've gotten around this by having our own file_update_time but this is kind of a pain, and Christoph has indicated he would like to make xfs do something different with atime updates. So introduce ->update_time, where we will deal with i_version an a/m/c time updates and indicate which changes need to be made. The normal version just does what it has always done, updates the time and marks the inode dirty, and then filesystems can choose to do something different. I've gone through all of the users of file_update_time and made them check for errors with the exception of the fault code since it's complicated and I wasn't quite sure what to do there, also Jan is going to be pushing the file time updates into page_mkwrite for those who have it so that should satisfy btrfs and make it not a big deal to check the file_update_time() return code in the generic fault path. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
bd1a68b5 |
|
04-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
vmsplice: relax alignement requirements for SPLICE_F_GIFT It seems there is no fundamental reason to limit vmsplice() SPLICE_F_GIFT to page aligned chunks. All helpers are prepared to cope with offsets in page. This limitation makes vmsplice() API very impractical in the zero-copy land. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Changli Gao <xiaosuo@gmail.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
35f9c09f |
|
04-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: tcp_sendpages() should call tcp_push() once commit 2f533844242 (tcp: allow splice() to build full TSO packets) added a regression for splice() calls using SPLICE_F_MORE. We need to call tcp_flush() at the end of the last page processed in tcp_sendpages(), or else transmits can be deferred and future sends stall. Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but with different semantic. For all sendpage() providers, its a transparent change. Only sock_sendpage() and tcp_sendpages() can differentiate the two different flags provided by pipe_to_sendpage() Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e8e3c3d6 |
|
25-Nov-2011 |
Cong Wang <amwang@redhat.com> |
fs: remove the second argument of k[un]map_atomic() Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Cong Wang <amwang@redhat.com>
|
#
630d9c47 |
|
16-Nov-2011 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
fs: reduce the use of module.h wherever possible For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
#
ff01bb48 |
|
16-Sep-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
fs: move code out of buffer.c Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
708e3508 |
|
25-Jul-2011 |
Hugh Dickins <hughd@google.com> |
tmpfs: clone shmem_file_splice_read() Copy __generic_file_splice_read() and generic_file_splice_read() from fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make page_cache_pipe_buf_ops and spd_release_page() accessible to it. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
825cdcb1 |
|
23-May-2011 |
Namhyung Kim <namhyung@gmail.com> |
splice: add wakeup_pipe_readers() Add and use wakeup_pipe_readers() to consolidate duplicated codes. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
a8adbe37 |
|
17-Dec-2010 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors This patch pulls calls to buf->ops->confirm() from all actors passed (also indirectly) to splice_from_pipe_feed(). Is avoiding the call to buf->ops->confirm() while splice()ing to /dev/null is an intentional optimization? No other user does that and this will remove this special case. Against current linux.git 6313e3c21743cc88bb5bd8aa72948ee1e83937b6. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
c66fb347 |
|
28-Nov-2010 |
Linus Torvalds <torvalds@linux-foundation.org> |
Export 'get_pipe_info()' to other users And in particular, use it in 'pipe_fcntl()'. The other pipe functions do not need to use the 'careful' version, since they are only ever called for things that are already known to be pipes. The normal read/write/ioctl functions are called through the file operations structures, so if a file isn't a pipe, they'd never get called. But pipe_fcntl() is special, and called directly from the generic fcntl code, and needs to use the same careful function that the splice code is using. Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
71993e62 |
|
28-Nov-2010 |
Linus Torvalds <torvalds@linux-foundation.org> |
Rename 'pipe_info()' to 'get_pipe_info()' .. and change it to take the 'file' pointer instead of an inode, since that's what all users want anyway. The renaming is preparatory to exporting it to other users. The old 'pipe_info()' name was too generic and is already used elsewhere, so before making the function public we need to use a more specific name. Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6965031d |
|
02-Aug-2010 |
Miklos Szeredi <miklos@szeredi.hu> |
splice: fix misuse of SPLICE_F_NONBLOCK SPLICE_F_NONBLOCK is clearly documented to only affect blocking on the pipe. In __generic_file_splice_read(), however, it causes an EAGAIN if the page is currently being read. This makes it impossible to write an application that only wants failure if the pipe is full. For example if the same process is handling both ends of a pipe and isn't otherwise able to determine whether a splice to the pipe will fill it or not. We could make the read non-blocking on O_NONBLOCK or some other splice flag, but for now this is the simplest fix. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
1676effc |
|
21-Jun-2010 |
Andi Kleen <andi@firstfloor.org> |
gcc-4.6: fs: fix unused but set warnings No real bugs I believe, just some dead code, and some shut up code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
19c9a49b |
|
29-Jun-2010 |
Changli Gao <xiaosuo@gmail.com> |
splice: check f_mode for seekable file check f_mode for seekable file As a seekable file is allowed without a llseek function, so the old way isn't work any more. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> ---- fs/splice.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
2cb4b05e |
|
29-Jun-2010 |
Changli Gao <xiaosuo@gmail.com> |
splice: direct_splice_actor() should not use pos in sd direct_splice_actor() shouldn't use sd->pos, as sd->pos is for file reading, file->f_pos should be used instead. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> ---- fs/splice.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
0ae0b5d0 |
|
25-May-2010 |
Nick Piggin <npiggin@suse.de> |
fs/splice.c: fix mapping_gfp_mask usage mapping_gfp_mask() is not supposed to store allocation contex details, only page location details. So mapping_gfp_mask should be applied to the pagecache page allocation, wheras normal (kernel mapped) memory should be used for surrounding allocations such as radix-tree nodes allocated by add_to_page_cache. Context modifiers should be applied on a per-callsite basis. So change splice to follow this convention (which is followed in similar code patterns in core code). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
35f3d14d |
|
20-May-2010 |
Jens Axboe <jens.axboe@oracle.com> |
pipe: add support for shrinking and growing pipes This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for growing and shrinking the size of a pipe and adjusts pipe.c and splice.c (and relay and network splice) usage to work with these larger (or smaller) pipes. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
cc56f7de |
|
04-Nov-2009 |
Changli Gao <xiaosuo@gmail.com> |
sendfile(): check f_op.splice_write() rather than f_op.sendpage() sendfile(2) was reworked with the splice infrastructure, but it still checks f_op.sendpage() instead of f_op.splice_write() wrongly. Although if f_op.sendpage() exists, f_op.splice_write() always exists at the same time currently, the assumption will be broken in future silently. This patch also brings a side effect: sendfile(2) can work with any output file. Some security checks related to f_op are added too. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
148f948b |
|
17-Aug-2009 |
Jan Kara <jack@suse.cz> |
vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Introduce new function for generic inode syncing (vfs_fsync_range) and use it from fsync() path. Introduce also new helper for syncing after a sync write (generic_write_sync) using the generic function. Use these new helpers for syncing from generic VFS functions. This makes O_SYNC writes to block devices acquire i_mutex for syncing. If we really care about this, we can make block_fsync() drop the i_mutex and reacquire it before it returns. CC: Evgeniy Polyakov <zbr@ioremap.net> CC: ocfs2-devel@oss.oracle.com CC: Joel Becker <joel.becker@oracle.com> CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com CC: Anton Altaparmakov <aia21@cantab.net> CC: linux-ntfs-dev@lists.sourceforge.net CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
|
#
723590ed |
|
15-Aug-2009 |
Miklos Szeredi <mszeredi@suse.cz> |
splice: update mtime and atime on files Splice should update the modification and access times on regular files just like read and write. Not updating mtime will confuse backup tools, etc... This patch only adds the time updates for regular files. For pipes and other special files that splice touches the need for updating the times is less clear. Let's discuss and fix that separately. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
b2858d7d |
|
19-May-2009 |
Miklos Szeredi <mszeredi@suse.cz> |
splice: fix kmaps in default_file_splice_write() Unfortunately multiple kmap() within a single thread are deadlockable, so writing out multiple buffers with writev() isn't possible. Change the implementation so that it does a separate write() for each buffer. This actually simplifies the code a lot since the splice_from_pipe() helper can be used. This limitation is caused by HIGHMEM pages, and so only affects a subset of architectures and configurations. In the future it may be worth to implement default_file_splice_write() in a more efficient way on configs that allow it. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
77f6bf57 |
|
14-May-2009 |
Andrew Morton <akpm@linux-foundation.org> |
splice: fix error return code fs/splice.c: In function 'default_file_splice_read': fs/splice.c:566: warning: 'error' may be used uninitialized in this function which is sort-of true. The code will in fact return -ENOMEM instead of the kernel_readv() return value. Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
4f231228 |
|
13-May-2009 |
Jens Axboe <jens.axboe@oracle.com> |
splice: fix repeated kmap()'s in default_file_splice_read() We cannot reliably map more than one page at the time, or we risk deadlocking. Just allocate the pages from low mem instead. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
0b0a47f5 |
|
07-May-2009 |
Miklos Szeredi <miklos@szeredi.hu> |
splice: implement default splice_write method If f_op->splice_write() is not implemented, fall back to a plain write. Use vfs_writev() to write from the pipe buffers. This will allow splice on all filesystems and file types. This includes "direct_io" files in fuse which bypass the page cache. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
6818173b |
|
07-May-2009 |
Miklos Szeredi <miklos@szeredi.hu> |
splice: implement default splice_read method If f_op->splice_read() is not implemented, fall back to a plain read. Use vfs_readv() to read into previously allocated pages. This will allow splice and functions using splice, such as the loop device, to work on all filesystems. This includes "direct_io" files in fuse which bypass the page cache. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
7c77f0b3 |
|
07-May-2009 |
Miklos Szeredi <miklos@szeredi.hu> |
splice: implement pipe to pipe splicing Allow splice(2) to work when both the input and the output is a pipe. Based on the impementation of the tee(2) syscall, but instead of duplicating the buffer references move the buffers from the input pipe to the output pipe. Moving the whole buffer only succeeds if the full length of the buffer is spliced. Otherwise duplicate the buffer, just like tee(2), set the length of the output buffer and advance the offset on the input buffer. Since splice is operating on two pipes, special care needs to be taken with locking to prevent AN ABBA deadlock. Again this is done similarly to the tee(2) syscall, first preparing the input and output pipes so there's data to consume and space for that data, and then doing the move operation while holding both locks. If other processes are doing I/O on the same pipes parallel to the splice, then by the time both inodes are locked there might be no buffers left to move, or no space to move them to. In this case retry the whole operation, including the preparation phase. This could lead to starvation, but I'm not sure if that's serious enough to worry about. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
b80901bb |
|
16-Apr-2009 |
Randy Dunlap <randy.dunlap@oracle.com> |
splice: fix new kernel-doc warnings splice: fix kernel-doc warnings Warning(fs/splice.c:617): bad line: Warning(fs/splice.c:722): No description found for parameter 'sd' Warning(fs/splice.c:722): Excess function parameter 'pipe' description in 'splice_from_pipe_begin' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
61e0d47c |
|
14-Apr-2009 |
Miklos Szeredi <miklos@szeredi.hu> |
splice: add helpers for locking pipe inode There are lots of sequences like this, especially in splice code: if (pipe->inode) mutex_lock(&pipe->inode->i_mutex); /* do something */ if (pipe->inode) mutex_unlock(&pipe->inode->i_mutex); so introduce helpers which do the conditional locking and unlocking. Also replace the inode_double_lock() call with a pipe_double_lock() helper to avoid spreading the use of this functionality beyond the pipe code. This patch is just a cleanup, and should cause no behavioral changes. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
f8cc774c |
|
14-Apr-2009 |
Miklos Szeredi <miklos@szeredi.hu> |
splice: remove generic_file_splice_write_nolock() Remove the now unused generic_file_splice_write_nolock() function. It's conceptually broken anyway, because splice may need to wait for pipe events so holding locks across the whole operation is wrong. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
328eaaba |
|
14-Apr-2009 |
Miklos Szeredi <miklos@szeredi.hu> |
ocfs2: fix i_mutex locking in ocfs2_splice_to_file() Rearrange locking of i_mutex on destination and call to ocfs2_rw_lock() so locks are only held while buffers are copied with the pipe_to_file() actor, and not while waiting for more data on the pipe. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
eb443e5a |
|
14-Apr-2009 |
Miklos Szeredi <miklos@szeredi.hu> |
splice: fix i_mutex locking in generic_splice_write() Rearrange locking of i_mutex on destination so it's only held while buffers are copied with the pipe_to_file() actor, and not while waiting for more data on the pipe. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
2933970b |
|
14-Apr-2009 |
Miklos Szeredi <miklos@szeredi.hu> |
splice: remove i_mutex locking in splice_from_pipe() splice_from_pipe() is only called from two places: - generic_splice_sendpage() - splice_write_null() Neither of these require i_mutex to be taken on the destination inode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
b3c2d2dd |
|
14-Apr-2009 |
Miklos Szeredi <miklos@szeredi.hu> |
splice: split up __splice_from_pipe() Split up __splice_from_pipe() into four helper functions: splice_from_pipe_begin() splice_from_pipe_next() splice_from_pipe_feed() splice_from_pipe_end() splice_from_pipe_next() will wait (if necessary) for more buffers to be added to the pipe. splice_from_pipe_feed() will feed the buffers to the supplied actor and return when there's no more data available (or if all of the requested data has been copied). This is necessary so that implementations can do locking around the non-waiting splice_from_pipe_feed(). This patch should not cause any change in behavior. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
7bfac9ec |
|
06-Apr-2009 |
Miklos Szeredi <mszeredi@suse.cz> |
splice: fix deadlock in splicing to file There's a possible deadlock in generic_file_splice_write(), splice_from_pipe() and ocfs2_file_splice_write(): - task A calls generic_file_splice_write() - this calls inode_double_lock(), which locks i_mutex on both pipe->inode and target inode - ordering depends on inode pointers, can happen that pipe->inode is locked first - __splice_from_pipe() needs more data, calls pipe_wait() - this releases lock on pipe->inode, goes to interruptible sleep - task B calls generic_file_splice_write(), similarly to the first - this locks pipe->inode, then tries to lock inode, but that is already held by task A - task A is interrupted, it tries to lock pipe->inode, but fails, as it is already held by task B - ABBA deadlock Fix this by explicitly ordering locks: the outer lock must be on target inode and the inner lock (which is later unlocked and relocked) must be on pipe->inode. This is OK, pipe inodes and target inodes form two nonoverlapping sets, generic_file_splice_write() and friends are not called with a target which is a pipe. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
266cf658 |
|
03-Apr-2009 |
David Howells <dhowells@redhat.com> |
FS-Cache: Recruit a page flags for cache management Recruit a page flag to aid in cache management. The following extra flag is defined: (1) PG_fscache (PG_private_2) The marked page is backed by a local cache and is pinning resources in the cache driver. If PG_fscache is set, then things that checked for PG_private will now also check for that. This includes things like truncation and page invalidation. The function page_has_private() had been added to make the checks for both PG_private and PG_private_2 at the same time. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
|
#
836f92ad |
|
14-Jan-2009 |
Heiko Carstens <hca@linux.ibm.com> |
[CVE-2009-0029] System call wrappers part 31 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
#
08e552c6 |
|
07-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: synchronized LRU A big patch for changing memcg's LRU semantics. Now, - page_cgroup is linked to mem_cgroup's its own LRU (per zone). - LRU of page_cgroup is not synchronous with global LRU. - page and page_cgroup is one-to-one and statically allocated. - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc); - SwapCache is handled. And, when we handle LRU list of page_cgroup, we do following. pc = lookup_page_cgroup(page); lock_page_cgroup(pc); .....................(1) mz = page_cgroup_zoneinfo(pc); spin_lock(&mz->lru_lock); .....add to LRU spin_unlock(&mz->lru_lock); unlock_page_cgroup(pc); But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock. So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct. This is a trial to remove this dirty nesting of locks. This patch changes mz->lru_lock to be zone->lru_lock. Then, above sequence will be written as spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU mem_cgroup_add/remove/etc_lru() { pc = lookup_page_cgroup(page); mz = page_cgroup_zoneinfo(pc); if (PageCgroupUsed(pc)) { ....add to LRU } spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU This is much simpler. (*) We're safe even if we don't take lock_page_cgroup(pc). Because.. 1. When pc->mem_cgroup can be modified. - at charge. - at account_move(). 2. at charge the PCG_USED bit is not set before pc->mem_cgroup is fixed. 3. at account_move() the page is isolated and not on LRU. Pros. - easy for maintenance. - memcg can make use of laziness of pagevec. - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup. - LRU status of memcg will be synchronized with global LRU's one. - # of locks are reduced. - account_move() is simplified very much. Cons. - may increase cost of LRU rotation. (no impact if memcg is not configured.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
4e02ed4b |
|
29-Oct-2008 |
Nick Piggin <npiggin@suse.de> |
fs: remove prepare_write/commit_write Nothing uses prepare_write or commit_write. Remove them from the tree completely. [akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
efc968d4 |
|
09-Oct-2008 |
Linus Torvalds <torvalds@linux-foundation.org> |
Don't allow splice() to files opened with O_APPEND This is debatable, but while we're debating it, let's disallow the combination of splice and an O_APPEND destination. It's not entirely clear what the semantics of O_APPEND should be, and POSIX apparently expects pwrite() to ignore O_APPEND, for example. So we could make up any semantics we want, including the old ones. But Miklos convinced me that we should at least give it some thought, and that accepting writes at arbitrary offsets is wrong at least for IS_APPEND() files (which always have O_APPEND set, even if the reverse isn't true: you can obviously have O_APPEND set on a regular file). So disallow O_APPEND entirely for now. I doubt anybody cares, and this way we have one less gray area to worry about. Reported-and-argued-for-by: Miklos Szeredi <miklos@szeredi.hu> Acked-by: Jens Axboe <ens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
529ae9aa |
|
01-Aug-2008 |
Nick Piggin <npiggin@suse.de> |
mm: rename page trylock Converting page lock to new locking bitops requires a change of page flag operation naming, so we might as well convert it to something nicer (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked). This also facilitates lockdeping of page lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2f1936b8 |
|
24-Jun-2008 |
Miklos Szeredi <mszeredi@suse.cz> |
[patch 3/5] vfs: change remove_suid() to file_remove_suid() All calls to remove_suid() are made with a file pointer, because (similarly to file_update_time) it is called when the file is written. Clean up callers by passing in a file instead of a dentry. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
#
bc40d73c |
|
25-Jul-2008 |
Nick Piggin <npiggin@suse.de> |
splice: use get_user_pages_fast Use get_user_pages_fast in splice. This reverts some mmap_sem batching there, however the biggest problem with mmap_sem tends to be hold times blocking out other threads rather than cacheline bouncing. Further: on architectures that implement get_user_pages_fast without locks, mmap_sem can be avoided completely anyway. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
32502b84 |
|
04-Jul-2008 |
Miklos Szeredi <mszeredi@suse.cz> |
splice: fix generic_file_splice_read() race with page invalidation If a page was invalidated during splicing from file to a pipe, then generic_file_splice_read() could return a short or zero count. This manifested itself in rare I/O errors seen on nfs exported fuse filesystems. This is because nfsd uses splice_direct_to_actor() to read files, and fuse uses invalidate_inode_pages2() to invalidate stale data on open. Fix by redoing the page find/create if it was found to be truncated (invalidated). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
ca39d651 |
|
20-May-2008 |
Jens Axboe <jens.axboe@oracle.com> |
splice: handle try_to_release_page() failure splice currently assumes that try_to_release_page() always suceeds, but it can return failure. If it does, we cannot steal the page. Acked-by: Mingming Cao <cmm@us.ibm.com Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
a82c53a0 |
|
09-May-2008 |
Tom Zanussi <zanussi@comcast.net> |
splice: fix sendfile() issue with relay Splice isn't always incrementing the ppos correctly, which broke relay splice. Signed-off-by: Tom Zanussi <zanussi@comcast.net> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
75065ff6 |
|
08-May-2008 |
Jens Axboe <jens.axboe@oracle.com> |
Revert "relay: fix splice problem" This reverts commit c3270e577c18b3d0e984c3371493205a4807db9d.
|
#
7f3d4ee1 |
|
07-May-2008 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: splice remove_suid() cleanup generic_file_splice_write() duplicates remove_suid() just because it doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe() anyway, so this is rather pointless. Move locking to generic_file_splice_write() and call remove_suid() and __splice_from_pipe() instead. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
c3270e57 |
|
23-Apr-2008 |
Tom Zanussi <zanussi@comcast.ne> |
relay: fix splice problem Splice isn't always incrementing the ppos correctly, which broke relay splice. Signed-off-by: Tom Zanussi <zanussi@comcast.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
8191ecd1 |
|
10-Apr-2008 |
Jens Axboe <jens.axboe@oracle.com> |
splice: fix infinite loop in generic_file_splice_read() There's a quirky loop in generic_file_splice_read() that could go on indefinitely, if the file splice returns 0 permanently (and not just as a temporary condition). Get rid of the loop and pass back -EAGAIN correctly from __generic_file_splice_read(), so we handle that condition properly as well. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
4cd13504 |
|
03-Apr-2008 |
Hugh Dickins <hugh@veritas.com> |
splice: use mapping_gfp_mask The loop block driver is careful to mask __GFP_IO|__GFP_FS out of its mapping_gfp_mask, to avoid hangs under memory pressure. But nowadays it uses splice, usually going through __generic_file_splice_read. That must use mapping_gfp_mask instead of GFP_KERNEL to avoid those hangs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
02cf01ae |
|
20-Feb-2008 |
Jens Axboe <jens.axboe@oracle.com> |
splice: only return -EAGAIN if there's hope of more data sys_tee() currently is a bit eager in returning -EAGAIN, it may do so even if we don't have a chance of anymore data becoming available. So improve the logic and only return -EAGAIN if we have an attached writer to the input pipe. Reported by Johann Felix Soden <johfel@gmx.de> and Patrick McManus <mcmanus@ducksong.com>. Tested-by: Johann Felix Soden <johfel@users.sourceforge.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
712a30e6 |
|
10-Feb-2008 |
Bastian Blank <bastian@waldi.eu.org> |
splice: fix user pointer access in get_iovec_page_array() Commit 8811930dc74a503415b35c4a79d14fb0b408a361 ("splice: missing user pointer access verification") added the proper access_ok() calls to copy_from_user_mmap_sem() which ensures we can copy the struct iovecs from userspace to the kernel. But we also must check whether we can access the actual memory region pointed to by the struct iovec to fix the access checks properly. Signed-off-by: Bastian Blank <waldi@debian.org> Acked-by: Oliver Pinter <oliver.pntr@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8811930d |
|
08-Feb-2008 |
Jens Axboe <jens.axboe@oracle.com> |
splice: missing user pointer access verification vmsplice_to_user() must always check the user pointer and length with access_ok() before copying. Likewise, for the slow path of copy_from_user_mmap_sem() we need to check that we may read from the user region. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Cc: Wojciech Purczynski <cliph@research.coseinc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
80848708 |
|
29-Jan-2008 |
Jens Axboe <jens.axboe@oracle.com> |
splice: always updated atime in direct splice Andre Majorel <aym-xunil@teaser.fr> points out that if we only updated the atime when we transfer some data, we deviate from the standard of always updating the atime. So change splice to always call file_accessed() even if splice_direct_to_actor() didn't transfer any data. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
9e97198d |
|
29-Jan-2008 |
Jens Axboe <jens.axboe@oracle.com> |
splice: fix problem with atime not being updated A bug report on nfsd that states that since it was switched to use splice instead of sendfile, the atime was no longer being updated on the input file. do_generic_mapping_read() does this when accessing the file, make splice do it for the direct splice handler. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
bbdfc2f7 |
|
07-Nov-2007 |
Jens Axboe <jens.axboe@oracle.com> |
[SPLICE]: Don't assume regular pages in splice_to_pipe() Allow caller to pass in a release function, there might be other resources that need releasing as well. Needed for network receive. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c43e259c |
|
12-Jan-2008 |
James Morris <jmorris@namei.org> |
security: call security_file_permission from rw_verify_area All instances of rw_verify_area() are followed by a call to security_file_permission(), so just call the latter from the former. Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
|
#
b5376771 |
|
17-Oct-2007 |
Serge E. Hallyn <serue@us.ibm.com> |
Implement file posix capabilities Implement file posix capabilities. This allows programs to be given a subset of root's powers regardless of who runs them, without having to use setuid and giving the binary all of root's powers. This version works with Kaigai Kohei's userspace tools, found at http://www.kaigai.gr.jp/index.php. For more information on how to use this patch, Chris Friedhoff has posted a nice page at http://www.friedhoff.org/fscaps.html. Changelog: Nov 27: Incorporate fixes from Andrew Morton (security-introduce-file-caps-tweaks and security-introduce-file-caps-warning-fix) Fix Kconfig dependency. Fix change signaling behavior when file caps are not compiled in. Nov 13: Integrate comments from Alexey: Remove CONFIG_ ifdef from capability.h, and use %zd for printing a size_t. Nov 13: Fix endianness warnings by sparse as suggested by Alexey Dobriyan. Nov 09: Address warnings of unused variables at cap_bprm_set_security when file capabilities are disabled, and simultaneously clean up the code a little, by pulling the new code into a helper function. Nov 08: For pointers to required userspace tools and how to use them, see http://www.friedhoff.org/fscaps.html. Nov 07: Fix the calculation of the highest bit checked in check_cap_sanity(). Nov 07: Allow file caps to be enabled without CONFIG_SECURITY, since capabilities are the default. Hook cap_task_setscheduler when !CONFIG_SECURITY. Move capable(TASK_KILL) to end of cap_task_kill to reduce audit messages. Nov 05: Add secondary calls in selinux/hooks.c to task_setioprio and task_setscheduler so that selinux and capabilities with file cap support can be stacked. Sep 05: As Seth Arnold points out, uid checks are out of place for capability code. Sep 01: Define task_setscheduler, task_setioprio, cap_task_kill, and task_setnice to make sure a user cannot affect a process in which they called a program with some fscaps. One remaining question is the note under task_setscheduler: are we ok with CAP_SYS_NICE being sufficient to confine a process to a cpuset? It is a semantic change, as without fsccaps, attach_task doesn't allow CAP_SYS_NICE to override the uid equivalence check. But since it uses security_task_setscheduler, which elsewhere is used where CAP_SYS_NICE can be used to override the uid equivalence check, fixing it might be tough. task_setscheduler note: this also controls cpuset:attach_task. Are we ok with CAP_SYS_NICE being used to confine to a cpuset? task_setioprio task_setnice sys_setpriority uses this (through set_one_prio) for another process. Need same checks as setrlimit Aug 21: Updated secureexec implementation to reflect the fact that euid and uid might be the same and nonzero, but the process might still have elevated caps. Aug 15: Handle endianness of xattrs. Enforce capability version match between kernel and disk. Enforce that no bits beyond the known max capability are set, else return -EPERM. With this extra processing, it may be worth reconsidering doing all the work at bprm_set_security rather than d_instantiate. Aug 10: Always call getxattr at bprm_set_security, rather than caching it at d_instantiate. [morgan@kernel.org: file-caps clean up for linux/capability.h] [bunk@kernel.org: unexport cap_inode_killpriv] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morgan <morgan@kernel.org> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
afddba49 |
|
16-Oct-2007 |
Nick Piggin <npiggin@suse.de> |
fs: introduce write_begin, write_end, and perform_write aops These are intended to replace prepare_write and commit_write with more flexible alternatives that are also able to avoid the buffered write deadlock problems efficiently (which prepare_write is unable to do). [mark.fasheh@oracle.com: API design contributions, code review and fixes] [akpm@linux-foundation.org: various fixes] [dmonakhov@sw.ru: new aop block_write_begin fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
f4e6b498 |
|
16-Oct-2007 |
Fengguang Wu <wfg@mail.ustc.edu.cn> |
readahead: combine file_ra_state.prev_index/prev_offset into prev_pos Combine the file_ra_state members unsigned long prev_index unsigned int prev_offset into loff_t prev_pos It is more consistent and better supports huge files. Thanks to Peter for the nice proposal! [akpm@linux-foundation.org: fix shift overflow] Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
6866bef4 |
|
16-Oct-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: fix double kunmap() in vmsplice copy path The out label should not include the unmap, the only way to jump there already has unmapped the source. 00002000 f7c21a00 00000000 00000000 c0489036 00018e32 00000002 00000000 00001000 Call Trace: [<c0487dd9>] pipe_to_user+0xca/0xd3 [<c0488233>] __splice_from_pipe+0x53/0x1bd [<c0454947>] ------------[ cut here ]------------ filemap_fault+0x221/0x380 [<c0487d0f>] pipe_to_user+0x0/0xd3 [<c0489036>] sys_vmsplice+0x3b7/0x422 [<c045ec3f>] kernel BUG at mm/highmem.c:206! handle_mm_fault+0x4d5/0x8eb [<c041ed5b>] kmap_atomic+0x1c/0x20 [<c045d33d>] unmap_vmas+0x3d1/0x584 [<c045f717>] free_pgtables+0x90/0xa0 [<c041d84b>] pgd_dtor+0x0/0x1 [<c044d665>] audit_syscall_exit+0x2aa/0x2c6 [<c0407817>] do_syscall_trace+0x124/0x169 [<c0404df2>] syscall_call+0x7/0xb ======================= Code: 2d 00 d0 5b 00 25 00 00 e0 ff 29 invalid opcode: 0000 [#1] c2 89 d0 c1 e8 0c 8b 14 85 a0 6c 7c c0 4a 85 d2 89 14 85 a0 6c 7c c0 74 07 31 c9 4a 75 15 eb 04 <0f> 0b eb fe 31 c9 81 3d 78 38 6d c0 78 38 6d c0 0f 95 c1 b0 01 EIP: [<c045bbc3>] kunmap_high+0x51/0x8e SS:ESP 0068:f5960df0 SMP Modules linked in: netconsole autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cmib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath dm_mod video output sbs batteryac parport_pc lp parport sg i2c_piix4 i2c_core floppy cfi_probe gen_probe scb2_flash mtd chipreg tg3 e1000 button ide_cd serio_raw cdrom aic7xxx scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 3 EIP: 0060:[<c045bbc3>] Not tainted VLI EFLAGS: 00010246 (2.6.23 #1) EIP is at kunmap_high+0x51/0x8e Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
75723957 |
|
01-Oct-2007 |
Linus Torvalds <torvalds@woody.linux-foundation.org> |
Fix possible splice() mmap_sem deadlock Nick Piggin points out that splice isn't being good about the mmap semaphore: while two readers can nest inside each others, it does leave a possible deadlock if a writer (ie a new mmap()) comes in during that nesting. Original "just move the locking" patch by Nick, replaced by one by me based on an optimistic pagefault_disable(). And then Jens tested and updated that patch. Reported-by: Nick Piggin <npiggin@suse.de> Tested-by: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
79685b8d |
|
27-Jul-2007 |
Randy Dunlap <randy.dunlap@oracle.com> |
docbook: add pipes, other fixes Fix some typos in pipe.c and splice.c. Add pipes API to kernel-api.tmpl. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
6a860c97 |
|
20-Jul-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: fix bad unlock_page() in error case If add_to_page_cache_lru() fails, the page will not be locked. But splice jumps to an error path that does a page release and unlock, causing a BUG() in unlock_page(). Fix this by adding one more label that just releases the page. This bug was actually triggered on EL5 by gurudas pai <gurudas.pai@oracle.com> using fio. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cf914a7d |
|
19-Jul-2007 |
Rusty Russell <rusty@rustcorp.com.au> |
readahead: split ondemand readahead interface into two functions Split ondemand readahead interface into two functions. I think this makes it a little clearer for non-readahead experts (like Rusty). Internally they both call ondemand_readahead(), but the page argument is changed to an obvious boolean flag. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d8983910 |
|
19-Jul-2007 |
Fengguang Wu <wfg@mail.ustc.edu.cn> |
readahead: pass real splice size Pass real splice size to page_cache_readahead_ondemand(). The splice code works in chunks of 16 pages internally. The readahead code should be told of the overall splice size, instead of the internal chunk size. Otherwize bad things may happen. Imagine some 17-page random splice reads. The code before this patch will result in two readahead calls: readahead(16); readahead(1); That leads to one 16-page I/O and one 32-page I/O: one extra I/O and 31 readahead miss pages. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
431a4820 |
|
19-Jul-2007 |
Fengguang Wu <wfg@mail.ustc.edu.cn> |
readahead: move synchronous readahead call out of splice loop Move synchronous page_cache_readahead_ondemand() call out of splice loop. This avoids one pointless page allocation/insertion in case of non-zero ra_pages, or many pointless readahead calls in case of zero ra_pages. Note that if a user sets ra_pages to less than PIPE_BUFFERS=16 pages, he will not get expected readahead behavior anyway. The splice code works in batches of 16 pages, which can be taken as another form of synchronous readahead. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a08a166f |
|
19-Jul-2007 |
Fengguang Wu <wfg@mail.ustc.edu.cn> |
readahead: convert splice invocations Convert splice reads to use on-demand readahead. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Jens Axboe <axboe@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bcd4f3ac |
|
16-Jul-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: direct splicing updates ppos twice OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> reported that he's noticed nfsd read corruption in recent kernels, and did the hard work of discovering that it's due to splice updating the file position twice. This means that the next operation would start further ahead than it should. nfsd_vfs_read() splice_direct_to_actor() while(len) { do_splice_to() [update sd->pos] -> generic_file_splice_read() [read from sd->pos] nfsd_direct_splice_actor() -> __splice_from_pipe() [update sd->pos] There's nothing wrong with the core splice code, but the direct splicing is an addon that calls both input and output paths. So it has to take care in locally caching offset so it remains correct. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
51a92c0f |
|
13-Jul-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: fix offset mangling with direct splicing (sendfile) If the output actor doesn't transfer the full amount of data, we will increment ppos too much. Two related bugs in there: - We need to break out and return actor() retval if it is shorted than what we spliced into the pipe. - Adjust ppos only according to actor() return. Also fix loop problem in generic_file_splice_read(), it should not keep going when data has already been transferred. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
29ce2058 |
|
13-Jul-2007 |
James Morris <jmorris@namei.org> |
security: revalidate rw permissions for sys_splice and sys_vmsplice Revalidate read/write permissions for splice(2) and vmslice(2), in case security policy has changed since the files were opened. Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
0845718d |
|
12-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
pipe: add documentation and comments As per Andrew Mortons request, here's a set of documentation for the generic pipe_buf_operations hooks, the pipe, and pipe_buffer structures. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
cac36bb0 |
|
14-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
pipe: change the ->pin() operation to ->confirm() The name 'pin' was badly chosen, it doesn't pin a pipe buffer in the most commonly used sense in the kernel. So change the name to 'confirm', after debating this issue with Hugh Dickins a bit. A good return from ->confirm() means that the buffer is really there, and that the contents are good. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
932cc6d4 |
|
21-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: completely document external interface with kerneldoc Also add fs/splice.c as a kerneldoc target with a smaller blurb that should be expanded to better explain the overview of splice. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
497f9625 |
|
10-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
pipe: allow passing around of ops private pointer relay needs this for proper consumption handling, and the network receive support needs it as well to lookup the sk_buff on pipe release. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
d6b29d7c |
|
04-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: divorce the splice structure/function definitions from the pipe header We need to move even more stuff into the header so that folks can use the splice_to_pipe() implementation instead of open-coding a lot of pipe knowledge (see relay implementation), so move to our own header file finally. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
6a14b90b |
|
14-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
vmsplice: add vmsplice-to-user support A bit of a cheat, it actually just copies the data to userspace. But this makes the interface nice and symmetric and enables people to build on splice, with room for future improvement in performance. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
c66ab6fa |
|
12-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: abstract out actor data For direct splicing (or private splicing), the output may not be a file. So abstract out the handling into a specified actor function and put the data in the splice_desc structure earlier, so we can build on top of that. This is the first step in better splice handling for drivers, and also for implementing vmsplice _to_ user memory. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
02676e5a |
|
15-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: only check do_wakeup in splice_to_pipe() for a real pipe We only ever set do_wakeup to non-zero if the pipe has an inode backing, so it's pointless to check outside the pipe->inode check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
00de00bd |
|
15-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: fix leak of pages on short splice to pipe If the destination pipe is full and we already transferred data, we break out instead of waiting for more pipe room. The exit logic looks at spd->nr_pages to see if we moved everything inside the spd container, but we decrement that variable in the loop to decide when spd has emptied. Instead we want to compare to the original page count in the spd, so cache that in a local variable. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
17ee4f49 |
|
15-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: adjust balance_dirty_pages_ratelimited() call As we have potentially dirtied more than 1 page, we should indicate as such to the dirty page balancing. So call balance_dirty_pages_ratelimited_nr() and pass in the approximate number of pages we dirtied. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
620a324b |
|
07-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: __generic_file_splice_read: fix read/truncate race Original patch and description from Neil Brown <neilb@suse.de>, merged and adapted to splice branch by me. Neils text follows: __generic_file_splice_read() currently samples the i_size at the start and doesn't do so again unless it needs to call ->readpage to load a page. After ->readpage it has to re-sample i_size as a truncate may have caused that page to be filled with zeros, and the read() call should not see these. However there are other activities that might cause ->readpage to be called on a page between the time that __generic_file_splice_read() samples i_size and when it finds that it has an uptodate page. These include at least read-ahead and possibly another thread performing a read So we must sample i_size *after* it has an uptodate page. Thus the current sampling at the start and after a read can be replaced with a sampling before page addition into spd. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
475ecade |
|
07-Jun-2007 |
Hugh Dickins <hugh@veritas.com> |
splice: __generic_file_splice_read: fix i_size_read() length checks __generic_file_splice_read's partial page check, at eof after readpage, not only got its calculations wrong, but also reused the loff variable: causing data corruption when splicing from a non-0 offset in the file's last page (revealed by ext2 -b 1024 testing on a loop of a tmpfs file). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
20d698db |
|
05-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: move balance_dirty_pages_ratelimited() outside of splice actor I've seen inode related deadlocks, so move this call outside of the actor itself, which may hold the inode lock. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
267adc3e |
|
08-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: remove do_splice_direct() symbol export It's only supposed to be used by do_sendfile(), which is never modular. So kill the export. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
d366d398 |
|
01-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
splice: move inode size check into generic_file_splice_read() Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
86aa5ac5 |
|
08-May-2007 |
Jens Axboe <jens.axboe@oracle.com> |
[PATCH] splice: always call into page_cache_readahead() Don't try to guess what the read-ahead logic will do, allow it to make its own decisions. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
9ae9d68c |
|
08-May-2007 |
Fengguang Wu <fengguang.wu@gmail.com> |
[PATCH] splice(): fix interaction with readahead Eric Dumazet, thank you for disclosing this bug. Readahead logic somehow fails to populate the page range with data. It can be because 1) the readahead routine is not always called in the following lines of fs/splice.c: if (!loff || nr_pages > 1) page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages); 2) even called, page_cache_readahead() wont guarantee the pages are there. It wont submit readahead I/O for pages already in the radix tree, or when (ra_pages == 0), or after 256 cache hits. In your case, it should be because of the retried reads, which lead to excessive cache hits, and disables readahead at some time. And that _one_ failure of readahead blocks the whole read process. The application receives EAGAIN and retries the read, but __generic_file_splice_read() refuse to make progress: - in the previous invocation, it has allocated a blank page and inserted it into the radix tree, but never has the chance to start I/O for it: the test of SPLICE_F_NONBLOCK goes before that. - in the retried invocation, the readahead code will neither get out of the cache hit mode, nor will it submit I/O for an already existing page. Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
d9993c37 |
|
29-Mar-2007 |
Dmitriy Monakhov <dmonakhov@sw.ru> |
[PATCH] splice: partial write fix Currently if partial write has happened while ->commit_write() then page wasn't marked as accessed and rebalanced. Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
40bee44e |
|
21-Mar-2007 |
Mark Fasheh <mark.fasheh@oracle.com> |
Export __splice_from_pipe() Ocfs2 wants to implement it's own splice write actor so that it can better manage cluster / page locks. This lets us re-use the rest of splice write while only providing our own code where it's actually important. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
08c72591 |
|
27-Mar-2007 |
Nick Piggin <npiggin@suse.de> |
2/2 splice: dont readpage Splice does not need to readpage to bring the page uptodate before writing to it, because prepare_write will take care of that for us. Splice is also wrong to SetPageUptodate before the page is actually uptodate. This results in the old uninitialised memory leak. This gets fixed as a matter of course when removing the readpage logic. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
485ddb4b |
|
27-Mar-2007 |
Nick Piggin <npiggin@suse.de> |
1/2 splice: dont steal Stealing pages with splice is problematic because we cannot just insert an uptodate page into the pagecache and hope the filesystem can take care of it later. We also cannot just ClearPageUptodate, then hope prepare_write does not write anything into the page, because I don't think prepare_write gives that guarantee. Remove support for SPLICE_F_MOVE for now. If we really want to bring it back, we might be able to do so with a the new filesystem buffered write aops APIs I'm working on. If we really don't want to bring it back, then we should decide that sooner rather than later, and remove the flag and all the stealing infrastructure before anybody starts using it. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
d4c3cca9 |
|
13-Dec-2006 |
Eric Dumazet <dada1@cosmosbay.com> |
[PATCH] constify pipe_buf_operations - pipe/splice should use const pipe_buf_operations and file_operations - struct pipe_inode_info has an unused field "start" : get rid of it. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
0f7fc9e4 |
|
08-Dec-2006 |
Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> |
[PATCH] VFS: change struct file to use struct path This patch changes struct file to use struct path instead of having independent pointers to struct dentry and struct vfsmount, and converts all users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}. Additionally, it adds two #define's to make the transition easier for users of the f_dentry and f_vfsmnt. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
ddac0d39 |
|
03-Nov-2006 |
Jens Axboe <jens.axboe@oracle.com> |
[PATCH] splice: fix problem introduced with inode diet After the inode slimming patch that unionised i_pipe/i_bdev/i_cdev, it's no longer enough to check for existance of ->i_pipe to verify that this is a pipe. Original patch from Eric Dumazet <dada1@cosmosbay.com> Final solution suggested by Linus. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
2ae88149 |
|
28-Oct-2006 |
Nick Piggin <npiggin@suse.de> |
[PATCH] mm: clean up pagecache allocation - Consolidate page_cache_alloc - Fix splice: only the pagecache pages and filesystem data need to use mapping_gfp_mask. - Fix grab_cache_page_nowait: same as splice, also honour NUMA placement. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
8c34e2d6 |
|
17-Oct-2006 |
Jens Axboe <jens.axboe@oracle.com> |
[PATCH] Remove SUID when splicing into an inode Originally from Mark Fasheh <mark.fasheh@oracle.com> generic_file_splice_write() does not remove S_ISUID or S_ISGID. This is inconsistent with the way we generally write to files. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
6da61809 |
|
17-Oct-2006 |
Mark Fasheh <mark.fasheh@oracle.com> |
[PATCH] Introduce generic_file_splice_write_nolock() This allows file systems to manage their own i_mutex locking while still re-using the generic_file_splice_write() logic. OCFS2 in particular wants this so that it can order cluster locks within i_mutex. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
62752ee1 |
|
17-Oct-2006 |
Mark Fasheh <mark.fasheh@oracle.com> |
[PATCH] Take i_mutex in splice_from_pipe() The splice_actor may be calling ->prepare_write() and ->commit_write(). We want i_mutex on the inode being written to before calling those so that we don't race i_size changes. The double locking behavior is done elsewhere in splice.c, and if we eventually want _nolock variants of generic_file_splice_write(), fs modules might have to replicate the nasty locking code. We introduce inode_double_lock() and inode_double_unlock() to consolidate the locking rules into one set of functions. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
e6e80f29 |
|
11-Oct-2006 |
Jens Axboe <jens.axboe@oracle.com> |
[PATCH] splice: fix pipe_to_file() ->prepare_write() error path Don't jump to the unlock+release path, we already did that. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
0fe23479 |
|
04-Sep-2006 |
Jens Axboe <axboe@kernel.dk> |
[PATCH] Update axboe@suse.de email address As people often look for the copyright in files to see who to mail, update the link to a neutral one. Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
aadd06e5 |
|
10-Jul-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fix problems with sys_tee() Several issues noticed/fixed: - We cannot reliably block in link_pipe() while holding both input and output mutexes. So do preparatory checks before locking down both mutexes and doing the link. - The ipipe->nrbufs vs i check was bad, because we could have dropped the ipipe lock in-between. This causes us to potentially look at unknown buffers if we were racing with someone else reading this pipe. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
9e94cd4f |
|
20-Jun-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: retrieve mapping after locking the page Otherwise we could be racing with truncate/mapping removal. Problem found/fixed by Nick Piggin <npiggin@suse.de>, logic rewritten by me. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
a0548871 |
|
03-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: redo page lookup if add_to_page_cache() returns -EEXIST This can happen quite easily, if several processes are trying to splice the same file at the same time. It's not a failure, it just means someone raced with us in allocating this file page. So just dump the allocated page and relookup the original. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
76ad4d11 |
|
03-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: rename remaining info variables to pipe Same thing was done in fs/pipe.c and most of fs/splice.c, but we had a few missing still. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
1432873a |
|
03-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: LRU fixups Nick says that the current construct isn't safe. This goes back to the original, but sets PIPE_BUF_FLAG_LRU on user pages as well as they all seem to be on the LRU in the first place. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
bfc4ee39 |
|
03-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fix unlocking of page on error ->prepare_write() Looking at generic_file_buffered_write(), we need to unlock_page() if prepare write fails and it isn't due to racing with truncate(). Also trim the size if ->prepare_write() fails, if we have to. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
330ab716 |
|
02-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] vmsplice: restrict stealing a little more Apply the same rules as the anon pipe pages, only allow stealing if no one else is using the page. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
a893b99b |
|
02-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fix page LRU accounting Currently we rely on the PIPE_BUF_FLAG_LRU flag being set correctly to know whether we need to fiddle with page LRU state after stealing it, however for some origins we just don't know if the page is on the LRU list or not. So remove PIPE_BUF_FLAG_LRU and do this check/add manually in pipe_to_file() instead. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
7591489a |
|
01-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] vmsplice: fix badly placed end paranthesis We need to use the minium of {len, PAGE_SIZE-off}, not {len, PAGE_SIZE}-off. The latter doesn't make any sense, and could cause us to attempt negative length transfers... Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
7afa6fd0 |
|
01-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] vmsplice: allow user to pass in gift pages If SPLICE_F_GIFT is set, the user is basically giving this pages away to the kernel. That means we can steal them for eg page cache uses instead of copying it. The data must be properly page aligned and also a multiple of the page size in length. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
f6762b7a |
|
01-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] pipe: enable atomic copying of pipe data to/from user space The pipe ->map() method uses kmap() to virtually map the pages, which is both slow and has known scalability issues on SMP. This patch enables atomic copying of pipe pages, by pre-faulting data and using kmap_atomic() instead. lmbench bw_pipe and lat_pipe measurements agree this is a Good Thing. Here are results from that on a UP machine with highmem (1.5GiB of RAM), running first a UP kernel, SMP kernel, and SMP kernel patched. Vanilla-UP: Pipe bandwidth: 1622.28 MB/sec Pipe bandwidth: 1610.59 MB/sec Pipe bandwidth: 1608.30 MB/sec Pipe latency: 7.3275 microseconds Pipe latency: 7.2995 microseconds Pipe latency: 7.3097 microseconds Vanilla-SMP: Pipe bandwidth: 1382.19 MB/sec Pipe bandwidth: 1317.27 MB/sec Pipe bandwidth: 1355.61 MB/sec Pipe latency: 9.6402 microseconds Pipe latency: 9.6696 microseconds Pipe latency: 9.6153 microseconds Patched-SMP: Pipe bandwidth: 1578.70 MB/sec Pipe bandwidth: 1579.95 MB/sec Pipe bandwidth: 1578.63 MB/sec Pipe latency: 9.1654 microseconds Pipe latency: 9.2266 microseconds Pipe latency: 9.1527 microseconds Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
e27dedd8 |
|
01-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: call handle_ra_miss() on failure to lookup page Notify the readahead logic of the missing page. Suggested by Oleg Nesterov. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
f84d7519 |
|
01-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] pipe: introduce ->pin() buffer operation The ->map() function is really expensive on highmem machines right now, since it has to use the slower kmap() instead of kmap_atomic(). Splice rarely needs to access the virtual address of a page, so it's a waste of time doing it. Introduce ->pin() to take over the responsibility of making sure the page data is valid. ->map() is then reduced to just kmap(). That way we can also share a most of the pipe buffer ops between pipe.c and splice.c Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
0568b409 |
|
01-May-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fix bugs in pipe_to_file() Found by Oleg Nesterov <oleg@tv-sign.ru>, fixed by me. - Only allow full pages to go to the page cache. - Check page != buf->page instead of using PIPE_BUF_FLAG_STOLEN. - Remember to clear 'stolen' if add_to_page_cache() fails. And as a cleanup on that: - Make the bottom fall-through logic a little less convoluted. Also make the steal path hold an extra reference to the page, so we don't have to differentiate between stolen and non-stolen at the end. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
46e678c9 |
|
30-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fix bugs with stealing regular pipe pages - Check that page has suitable count for stealing in the regular pipes. - pipe_to_file() assumes that the page is locked on succesful steal, so do that in the pipe steal hook - Missing unlock_page() in add_to_page_cache() failure. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
eb20796b |
|
27-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: make the read-side do batched page lookups Use the new find_get_pages_contig() to potentially look up the entire splice range in one single call. This speeds up generic_file_splice_read() quite a bit. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
eb645a24 |
|
27-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: switch to using page_cache_readahead() Avoids doing useless work, when the file is fully cached. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
00522fb4 |
|
26-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: rearrange moving to/from pipe helpers We need these for people writing their own ->splice_read/write hooks. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
912d35f8 |
|
26-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] Add support for the sys_vmsplice syscall sys_splice() moves data to/from pipes with a file input/output. sys_vmsplice() moves data to a pipe, with the input being a user address range instead. This uses an approach suggested by Linus, where we can hold partial ranges inside the pages[] map. Hopefully this will be useful for network receive support as well. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
016b661e |
|
25-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fix offset problems Make the move_from_pipe() actors return number of bytes processed, then move_from_pipe() can decide more cleverly when to move on to the next buffer. This fixes problems with pipe offset and differing file offset. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
ba5f5d90 |
|
25-Apr-2006 |
Andrew Morton <akpm@osdl.org> |
[PATCH] splice: fix min() warning Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
82aa5d61 |
|
20-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fix smaller sized splice reads Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
9e0267c2 |
|
19-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fixup writeout path after ->map changes Since ->map() no longer locks the page, we need to adjust the handling of those pages (and stealing) a little. This now passes full regressions again. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
a4514ebd |
|
19-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: offset fixes - We need to adjust *ppos for writes as well. - Copy back modified offset value if one was passed in, similar to what sendfile does. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
2a27250e |
|
19-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] tee: link_pipe() must be careful when dropping one of the pipe locks We need to ensure that we only drop a lock that is ordered last, to avoid ABBA deadlocks with competing processes. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
c4f895cb |
|
19-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: cleanup the SPLICE_F_NONBLOCK handling - generic_file_splice_read() more readable and correct - Don't bail on page allocation with NONBLOCK set, just don't allow direct blocking on IO (eg lock_page). Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
91ad66ef |
|
19-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: close i_size truncate races on read We need to check i_size after doing a blocking readpage. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
70524490 |
|
11-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: add support for sys_tee() Basically an in-kernel implementation of tee, which uses splice and the pipe buffers as an intelligent way to pass data around by reference. Where the user space tee consumes the input and produces a stdout and file output, this syscall merely duplicates the data inside a pipe to another pipe. No data is copied, the output just grabs a reference to the input pipe data. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
cbb7e577 |
|
11-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: pass offset around for ->splice_read() and ->splice_write() We need not use ->f_pos as the offset for the file input/output. If the user passed an offset pointer in through sys_splice(), just use that and leave ->f_pos alone. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
73d62d83 |
|
11-Apr-2006 |
Ingo Molnar <mingo@elte.hu> |
[PATCH] splice: comment styles - capitalize consistently - end sentences in one way or another - update comment text to match the implementation Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
c2058e06 |
|
11-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: add Ingo as addition copyright holder The comment is also somewhat out of date, correct that as well. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
49570e9b |
|
11-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: unlikely() optimizations Also corrects a few comments. Patch mainly from Ingo, changes by me. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
6f767b04 |
|
11-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: speedups and optimizations - Kill the local variables that cache ->nrbufs, they just take up space. - Only set do_wakeup for a real pipe. This is a big win for direct splicing. - Kill i_mutex lock around ->f_pos update, regular io paths don't do this either. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
7480a904 |
|
11-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: speedup __generic_file_splice_read Using find_get_page() is a lot faster than find_or_create_page(). This gets splice a lot closer to sendfile() for fd -> socket transfers. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
b92ce558 |
|
11-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: add direct fd <-> fd splicing support It's more efficient for sendfile() emulation. Basically we cache an internal private pipe and just use that as the intermediate area for pages. Direct splicing is not available from sys_splice(), it is only meant to be used for sendfile() emulation. Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at exit for the normal fast path. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
529565dc |
|
10-Apr-2006 |
Ingo Molnar <mingo@elte.hu> |
[PATCH] splice: add optional input and output offsets add optional input and output offsets to sys_splice(), for seekable file descriptors: asmlinkage long sys_splice(int fd_in, loff_t __user *off_in, int fd_out, loff_t __user *off_out, size_t len, unsigned int flags); semantics are straightforward: f_pos will be updated with the offset provided by user-space, before the splice transfer is about to begin. Providing a NULL offset pointer means the existing f_pos will be used (and updated in situ). Providing an offset for a pipe results in -ESPIPE. Providing an invalid offset pointer results in -EFAULT. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
3a326a2c |
|
10-Apr-2006 |
Ingo Molnar <mingo@elte.hu> |
[PATCH] introduce a "kernel-internal pipe object" abstraction separate out the 'internal pipe object' abstraction, and make it usable to splice. This cleans up and fixes several aspects of the internal splice APIs and the pipe code: - pipes: the allocation and freeing of pipe_inode_info is now more symmetric and more streamlined with existing kernel practices. - splice: small micro-optimization: less pointer dereferencing in splice methods Signed-off-by: Ingo Molnar <mingo@elte.hu> Update XFS for the ->splice_read/->splice_write changes. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
0b749ce3 |
|
10-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: be smarter about calling do_page_cache_readahead() We don't want to call into the read-ahead logic unless we are at the start of a page, _or_ we have multiple pages to read. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
49d0b21b |
|
10-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: optimize the splice buffer mapping We don't really need to lock down the pages, just make sure they are uptodate. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
16c523dd |
|
10-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: cleanup __generic_file_splice_read() The whole shadow/pages logic got overly complex, and this simpler approach is actually faster in testing. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
c0bd1f65 |
|
10-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: only call wake_up_interruptible() when we really have to __wake_up_common() is pretty heavy in the kernel profiles, this brings it down to a more acceptable level. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
9aefe431 |
|
10-Apr-2006 |
Dave Jones <davej@redhat.com> |
[PATCH] splice: potential !page dereference We can get to out: with a NULL page, which we probably don't want to be calling page_cache_release() on. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
c7f21e4f |
|
10-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: mark the io page as accessed We should do that, since we do the LRU manipulation ourselves now. Suggested by Nick Piggin. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
3e7ee3e7 |
|
02-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fix page stealing LRU handling. Originally from Nick Piggin, just adapted to the newer branch. You can't check PageLRU without holding zone->lru_lock. The page release code can get away with it only because the page refcount is 0 at that point. Also, you can't reliably remove pages from the LRU unless the refcount is 0. Ever. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
ad8d6f0a |
|
02-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: page stealing needs to wait_on_page_writeback() Thanks to Andrew for the good explanation of why this is so. akpm writes: If a page is under writeback and we remove it from pagecache, it's still going to get written to disk. But the VFS no longer knows about that page, nor that this page is about to modify disk blocks. So there might be scenarios in which those blocks-which-are-about-to-be-written-to get reused for something else. When writeback completes, it'll scribble on those blocks. This won't happen in ext2/ext3-style filesystems in normal mode because the page has buffers and try_to_release_page() will fail. But ext2 in nobh mode doesn't attach buffers at all - it just sticks the page in a BIO, finds some new blocks, points the BIO at those blocks and lets it rip. While that write IO's in flight, someone could truncate the file. Truncate won't block on the writeout because the page isn't in pagecache any more. So truncate will the free the blocks from the file under the page's feet. Then something else can reallocate those blocks. Then write data to them. Now, the original write completes, corrupting the filesystem. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
059a8f37 |
|
02-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: export generic_splice_sendpage Forgot that one, thanks Jeff. Also move the other EXPORT_SYMBOL to right below the functions. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
b2b39fa4 |
|
02-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: add a SPLICE_F_MORE flag This lets userspace indicate whether more data will be coming in a subsequent splice call. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
83f9135b |
|
02-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: add comments documenting more of the code Hopefully this will make Andrew a little more happy. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
4f6f0bd2 |
|
02-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: improve writeback and clean up page stealing By cleaning up the writeback logic (killing write_one_page() and the manual set_page_dirty()), we can get rid of ->stolen inside the pipe_buffer and just keep it local in pipe_to_file(). This also adds dirty page balancing logic and O_SYNC handling. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
53cd9ae8 |
|
02-Apr-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: fix shadow[] filling logic Clear the entire range, and don't increment pidx or we keep filling the same position again and again. Thanks to KAMEZAWA Hiroyuki. Signed-off-by: Jens Axboe <axboe@suse.de>
|
#
29e35094 |
|
02-Apr-2006 |
Linus Torvalds <torvalds@g5.osdl.org> |
splice: add SPLICE_F_NONBLOCK flag It doesn't make the splice itself necessarily nonblocking (because the actual file descriptors that are spliced from/to may block unless they have the O_NONBLOCK flag set), but it makes the splice pipe operations nonblocking. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
a0f06780 |
|
30-Mar-2006 |
Jeff Garzik <jeff@garzik.org> |
[PATCH] splice exports Woe be unto he who builds their filesystems as modules. Signed-off-by: Jeff Garzik <jeff@garzik.org> [ Obscure quote from the infamous geek bible? ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
5abc97aa |
|
30-Mar-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] splice: add support for SPLICE_F_MOVE flag This enables the caller to migrate pages from one address space page cache to another. In buzz word marketing, you can do zero-copy file copies! Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
5274f052 |
|
30-Mar-2006 |
Jens Axboe <axboe@suse.de> |
[PATCH] Introduce sys_splice() system call This adds support for the sys_splice system call. Using a pipe as a transport, it can connect to files or sockets (latter as output only). From the splice.c comments: "splice": joining two ropes together by interweaving their strands. This is the "extended pipe" functionality, where a pipe is used as an arbitrary in-memory buffer. Think of a pipe as a small kernel buffer that you can use to transfer data from one end to the other. The traditional unix read/write is extended with a "splice()" operation that transfers data buffers to or from a pipe buffer. Named by Larry McVoy, original implementation from Linus, extended by Jens to support splicing to files and fixing the initial implementation bugs. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|