History log of /linux-master/fs/sync.c
Revision Date Author Comments
# 59c10c52 05-Apr-2022 Guo Ren <guoren@kernel.org>

riscv: compat: syscall: Add compat_sys_call_table implementation

Implement compat sys_call_table and some system call functions:
truncate64, ftruncate64, fallocate, pread64, pwrite64,
sync_file_range, readahead, fadvise64_64 which need argument
translation.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Link: https://lore.kernel.org/r/20220405071314.3225832-12-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>


# 5679897e 30-Jan-2022 Darrick J. Wong <djwong@kernel.org>

vfs: make sync_filesystem return errors from ->sync_fs

Strangely, sync_filesystem ignores the return code from the ->sync_fs
call, which means that syscalls like syncfs(2) never see the error.
This doesn't seem right, so fix that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>


# 1e03a36b 19-Oct-2021 Christoph Hellwig <hch@lst.de>

block: simplify the block device syncing code

Get rid of the indirections and just provide a sync_bdevs
helper for the generic sync code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 70164eb6 19-Oct-2021 Christoph Hellwig <hch@lst.de>

block: remove __sync_blockdev

Instead offer a new sync_blockdev_nowait helper for the !wait case.
This new helper is exported as it will grow modular callers in a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9a208ba5 19-Oct-2021 Christoph Hellwig <hch@lst.de>

fs: remove __sync_filesystem

There is no clear benefit in having this helper vs just open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 735e4ae5 01-Jun-2020 Jeff Layton <jlayton@kernel.org>

vfs: track per-sb writeback errors and report them to syncfs

Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.

Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.

This patch (of 2):

Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.

Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.

It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.

This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.

To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.

An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.

I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32b1924b2 09-Apr-2020 Konstantin Khlebnikov <koct9i@gmail.com>

ovl: skip overlayfs superblocks at global sync

Stacked filesystems like overlayfs has no own writeback, but they have to
forward syncfs() requests to backend for keeping data integrity.

During global sync() each overlayfs instance calls method ->sync_fs() for
backend although it itself is in global list of superblocks too. As a
result one syscall sync() could write one superblock several times and send
multiple disk barriers.

This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c553ea4f 13-May-2019 Amir Goldstein <amir73il@gmail.com>

fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback

23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
writeback") claims that sync_file_range(2) syscall was "created for
userspace to be able to issue background writeout and so waiting for
in-flight IO is undesirable there" and changes the writeback (back) to
WB_SYNC_NONE.

This claim is only partially true. It is true for users that use the flag
SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was the
reason for changing to WB_SYNC_NONE writeback.

However, that claim is not true for users that use that flag combination
SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}. Those users explicitly
requested to wait for in-flight IO as well as to writeback of dirty pages.

Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
WB_SYNC_ALL writeback to perform the full range sync request.

Link: http://lkml.kernel.org/r/20190409114922.30095-1-amir73il@gmail.com
Link: http://lkml.kernel.org/r/20190419072938.31320-1-amir73il@gmail.com
Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 22f96b38 09-Apr-2019 Jens Axboe <axboe@kernel.dk>

fs: add sync_file_range() helper

This just pulls out the ksys_sync_file_range() code to work on a struct
file instead of an fd, so we can use it elsewhere.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 806cbae1 11-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall

Using this helper allows us to avoid the in-kernel calls to the
sys_sync_file_range() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it uses
the same calling convention as sys_sync_file_range().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>


# 70f68ee8 14-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

fs: add ksys_sync() helper; remove in-kernel calls to sys_sync()

Using this helper allows us to avoid the in-kernel calls to the
sys_sync() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_sync().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>


# 0d07e557 06-Mar-2018 Christoph Hellwig <hch@lst.de>

fs: don't clear I_DIRTY_TIME before calling mark_inode_dirty_sync

__mark_inode_dirty already takes care of that, and for the XFS lazytime
implementation we need to know that ->dirty_inode was called because
I_DIRTY_TIME was set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 9ba4b2df 20-Sep-2017 Jens Axboe <axboe@kernel.dk>

fs: kill 'nr_pages' argument from wakeup_flusher_threads()

Everybody is passing in 0 now, let's get rid of the argument.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# de23abd1 06-Sep-2017 Jeff Layton <jlayton@kernel.org>

fs/sync.c: remove unnecessary NULL f_mapping check in sync_file_range

fsync codepath assumes that f_mapping can never be NULL, but
sync_file_range has a check for that.

Remove the one from sync_file_range as I don't see how you'd ever get a
NULL pointer in here.

Link: http://lkml.kernel.org/r/20170525110509.9434-1-jlayton@redhat.com
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6454568d 24-Jul-2017 Jeff Layton <jlayton@kernel.org>

fs: convert sync_file_range to use errseq_t based error-tracking

sync_file_range doesn't call down into the filesystem directly at all.
It only kicks off writeback of pagecache pages and optionally waits
on the result.

Convert sync_file_range to use errseq_t based error tracking, under the
assumption that most users will prefer this behavior when errors occur.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>


# bc98a42c 17-Jul-2017 David Howells <dhowells@redhat.com>

VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)

Firstly by applying the following with coccinelle's spatch:

@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)

@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)

to remove left over excess bracketage and finally by applying:

@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>


# 0f41074a 05-Jul-2017 Jeff Layton <jlayton@kernel.org>

fs: remove call_fsync helper function

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>


# 0eb8af49 20-Feb-2017 Miklos Szeredi <mszeredi@redhat.com>

vfs: use helper for calling f_op->fsync()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23d01270 06-Nov-2015 Jan Kara <jack@suse.com>

fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writeback

sync_file_range(2) is documented to issue writeback only for pages that
are not currently being written. After all the system call has been
created for userspace to be able to issue background writeout and so
waiting for in-flight IO is undesirable there. However commit
ee53a891f474 ("mm: do_sync_mapping_range integrity fix") switched
do_sync_mapping_range() and thus sync_file_range() to issue writeback in
WB_SYNC_ALL mode since do_sync_mapping_range() was used by other code
relying on WB_SYNC_ALL semantics.

These days do_sync_mapping_range() went away and we can switch
sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should
help PostgreSQL avoid large latency spikes when flushing data in the
background.

Andres measured a 20% increase in transactions per second on an SSD disk.

Signed-off-by: Jan Kara <jack@suse.com>
Reported-by: Andres Freund <andres@anarazel.de>
Tested-By: Andres Freund <andres@anarazel.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aa750fd7 05-Nov-2015 Junichi Nomura <j-nomura@ce.jp.nec.com>

mm/filemap.c: make global sync not clear error status of individual inodes

filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during
writeback.

The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).

However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.

As a result, fsync() may not be able to detect writeback error if events
happen in the following order:

Application System admin
----------------------------------------------------------
write data on page cache
Run sync command
writeback completes with error
filemap_fdatawait() clears error
fsync returns success
(but the data is not on disk)

This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0ae45f63 01-Feb-2015 Theodore Ts'o <tytso@mit.edu>

vfs: add support for a lazytime mount option

Add a new mount option which enables a new "lazytime" mode. This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode. The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table. The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives. Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact long tail latencies --- which
is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b583043e 30-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

kill f_dentry uses

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 10096fb1 21-Aug-2014 Anton Altaparmakov <aia21@cam.ac.uk>

Export sync_filesystem() for modular ->remount_fs() use

This patch changes sync_filesystem() to be EXPORT_SYMBOL().

The reason this is needed is that starting with 3.15 kernel, due to
Theodore Ts'o's commit 02b9984d6408 ("fs: push sync_filesystem() down to
the file system's remount_fs()"), all file systems that have dirty data
to be written out need to call sync_filesystem() from their
->remount_fs() method when remounting read-only.

As this is now a generically required function rather than an internal
only function it should be EXPORT_SYMBOL() so that all file systems can
call it.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0dc83bd3 21-Feb-2014 Jan Kara <jack@suse.cz>

Revert "writeback: do not sync data dirtied after sync start"

This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave
Chinner <david@fromorbit.com> has reported the commit may cause some
inodes to be left out from sync(2). This is because we can call
redirty_tail() for some inode (which sets i_dirtied_when to current time)
after sync(2) has started or similarly requeue_inode() can set
i_dirtied_when to current time if writeback had to skip some pages. The
real problem is in the functions clobbering i_dirtied_when but fixing
that isn't trivial so revert is a safer choice for now.

CC: stable@vger.kernel.org # >= 3.13
Signed-off-by: Jan Kara <jack@suse.cz>


# d311d79d 09-Feb-2014 Al Viro <viro@zeniv.linux.org.uk>

fix O_SYNC|O_APPEND syncing the wrong range on write()

It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
when sync_page_range() had been introduced; generic_file_write{,v}() correctly
synced
pos_after_write - written .. pos_after_write - 1
but generic_file_aio_write() synced
pos_before_write .. pos_before_write + written - 1
instead. Which is not the same thing with O_APPEND, obviously.
A couple of years later correct variant had been killed off when
everything switched to use of generic_file_aio_write().

All users of generic_file_aio_write() are affected, and the same bug
has been copied into other instances of ->aio_write().

The fix is trivial; the only subtle point is that generic_write_sync()
ought to be inlined to avoid calculations useless for the majority of
calls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c4a391b5 12-Nov-2013 Jan Kara <jack@suse.cz>

writeback: do not sync data dirtied after sync start

When there are processes heavily creating small files while sync(2) is
running, it can easily happen that quite some new files are created
between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
especially if there are several busy filesystems (remember that sync
traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
(e.g. causes a transaction commit and cache flush for each inode in
ext3), resulting sync(2) times are rather large.

The following script reproduces the problem:

function run_writers
{
for (( i = 0; i < 10; i++ )); do
mkdir $1/dir$i
for (( j = 0; j < 40000; j++ )); do
dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
done &
done
}

for dir in "$@"; do
run_writers $dir
done

sleep 40
time sync

Fix the problem by disregarding inodes dirtied after sync(2) was called
in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
a time stamp when sync has started which is used for setting up work for
flusher threads.

To give some numbers, when above script is run on two ext4 filesystems
on simple SATA drive, the average sync time from 10 runs is 267.549
seconds with standard deviation 104.799426. With the patched kernel,
the average sync time from 10 runs is 2.995 seconds with standard
deviation 0.096.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72c2d531 22-Sep-2013 Al Viro <viro@zeniv.linux.org.uk>

file->f_op is never NULL...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4a0fd5bf 21-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long

... and convert a bunch of SYSCALL_DEFINE ones to SYSCALL_DEFINE<n>,
killing the boilerplate crap around them.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2903ff01 27-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

switch simple cases of fget_light to fdget

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4ea425b6 03-Jul-2012 Jan Kara <jack@suse.cz>

vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes

wakeup_flusher_threads(0) will queue work doing complete writeback for each
flusher thread. Thus there is not much point in submitting another work doing
full inode WB_SYNC_NONE writeback by writeback_inodes_sb().

After this change it does not make sense to call nonblocking ->sync_fs and
block device flush before calling sync_inodes_sb() because
wakeup_flusher_threads() is completely asynchronous and thus these functions
would be called in parallel with inode writeback running which will effectively
void any work they do. So we move sync_inodes_sb() call before these two
functions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d0e91b13 03-Jul-2012 Jan Kara <jack@suse.cz>

vfs: Remove unnecessary flushing of block devices

It is not necessary to write block devices twice. The reason why we first did
flush and then proper sync is that
for_each_bdev() {
write_bdev()
wait_for_completion()
}
is much slower than
for_each_bdev()
write_bdev()
for_each_bdev()
wait_for_completion()
when there is bigger amount of data. But as is seen in the above, there's no real
need to scan pages and submit them twice. We just need to separate the submission
and waiting part. This patch does that.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a8c7176b 03-Jul-2012 Jan Kara <jack@suse.cz>

vfs: Make sys_sync writeout also block device inodes

In case block device does not have filesystem mounted on it, sys_sync will just
ignore it and doesn't writeout its dirty pages. This is because writeback code
avoids writing inodes from superblock without backing device and
blockdev_superblock is such a superblock. Since it's unexpected that sync
doesn't writeout dirty data for block devices be nice to users and change the
behavior to do so. So now we iterate over all block devices on blockdev_super
instead of iterating over all superblocks when syncing block devices.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b3de6531 03-Jul-2012 Jan Kara <jack@suse.cz>

vfs: Reorder operations during sys_sync

Change the order of operations during sync from

for_each_sb {
writeback_inodes_sb();
sync_fs(nowait);
__sync_blockdev(nowait);
}
for_each_sb {
sync_inodes_sb();
sync_fs(wait);
__sync_blockdev(wait);
}

to

for_each_sb
writeback_inodes_sb();
for_each_sb
sync_fs(nowait);
for_each_sb
__sync_blockdev(nowait);
for_each_sb
sync_inodes_sb();
for_each_sb
sync_fs(wait);
for_each_sb
__sync_blockdev(wait);

This is a preparation for the following patches in this series.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a1177825 03-Jul-2012 Jan Kara <jack@suse.cz>

quota: Move quota syncing to ->sync_fs method

Since the moment writes to quota files are using block device page cache and
space for quota structures is reserved at the moment they are first accessed we
have no reason to sync quota before inode writeback. In fact this order is now
only harmful since quota information can easily change during inode writeback
(either because conversion of delayed-allocated extents or simply because of
allocation of new blocks for simple filesystems not using page_mkwrite).

So move syncing of quota information after writeback of inodes into ->sync_fs
method. This way we do not have to use ->quota_sync callback which is primarily
intended for use by quotactl syscall anyway and we get rid of calling
->sync_fs() twice unnecessarily. We skip quota syncing for OCFS2 since it does
proper quota journalling in all cases (unlike ext3, ext4, and reiserfs which
also support legacy non-journalled quotas) and thus there are no dirty quota
structures.

CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Joel Becker <jlbec@evilplan.org>
CC: reiserfs-devel@vger.kernel.org
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Dave Kleikamp <shaggy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ceed1723 03-Jul-2012 Jan Kara <jack@suse.cz>

quota: Split dquot_quota_sync() to writeback and cache flushing part

Split off part of dquot_quota_sync() which writes dquots into a quota file
to a separate function. In the next patch we will use the function from
filesystems and we do not want to abuse ->quota_sync quotactl callback more
than necessary.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6eedc701 03-Jul-2012 Jan Kara <jack@suse.cz>

vfs: Move noop_backing_dev_info check from sync into writeback

In principle, a filesystem may want to have ->sync_fs() called during sync(1)
although it does not have a bdi (i.e. s_bdi is set to noop_backing_dev_info).
Only writeback code really needs bdi set to something reasonable. So move the
checks where they are more logical.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 331cbdee 08-Jun-2012 Wanpeng Li <liwp@linux.vnet.ibm.com>

writeback: Fix some comment errors

Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>


# c2bd6c11 19-Apr-2012 Al Viro <viro@zeniv.linux.org.uk>

switch do_fsync() to fget_light()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 630d9c47 16-Nov-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

fs: reduce the use of module.h wherever possible

For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include. Fix up any implicit
include dependencies that were being masked by module.h along
the way.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# ff01bb48 16-Sep-2011 Al Viro <viro@zeniv.linux.org.uk>

fs: move code out of buffer.c

Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0e175a18 07-Oct-2011 Curt Wohlgemuth <curtw@google.com>

writeback: Add a 'reason' to wb_writeback_work

This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity. A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.

The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.

And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>


# 02c24a82 16-Jul-2011 Josef Bacik <josef@redhat.com>

fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b7ed78f5 10-Mar-2011 Sage Weil <sage@newdream.net>

introduce sys_syncfs to sync a single file system

It is frequently useful to sync a single file system, instead of all
mounted file systems via sync(2):

- On machines with many mounts, it is not at all uncommon for some of
them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
those and may never get to the one you do care about (e.g., /).
- Some applications write lots of data to the file system and then
want to make sure it is flushed to disk. Calling fsync(2) on each
file introduces unnecessary ordering constraints that result in a large
amount of sub-optimal writeback/flush/commit behavior by the file
system.

There are currently two ways (that I know of) to sync a single super_block:

- BLKFLSBUF ioctl on the block device: That also invalidates the bdev
mapping, which isn't usually desirable, and doesn't work for non-block
file systems.
- 'mount -o remount,rw' will call sync_filesystem as an artifact of the
current implemention. Relying on this little-known side effect for
something like data safety sounds foolish.

Both of these approaches require root privileges, which some applications
do not have (nor should they need?) given that sync(2) is an unprivileged
operation.

This patch introduces a new system call syncfs(2) that takes an fd and
syncs only the file system it references. Maybe someday we can

$ sync /some/path

and not get

sync: ignoring all arguments

The syscall is motivated by comments by Al and Christoph at the last LSF.
syncfs(2) seems like an appropriate name given statfs(2).

A similar ioctl was also proposed a while back, see
http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 95f28604 17-Mar-2011 Jens Axboe <jaxboe@fusionio.com>

fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away

We don't have proper reference counting for this yet, so we run into
cases where the device is pulled and we OOPS on flushing the fs data.
This happens even though the dirty inodes have already been
migrated to the default_backing_dev_info.

Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# b5fc510c 03-Jul-2010 Al Viro <viro@zeniv.linux.org.uk>

get rid of file_fsync()

Copy and simplify in the only two users remaining.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0e3c9a22 01-Jun-2010 Jens Axboe <jaxboe@fusionio.com>

Revert "writeback: fix WB_SYNC_NONE writeback from umount"

This reverts commit e913fc825dc685a444cb4c1d0f9d32f372f59861.

We are investigating a hang associated with the WB_SYNC_NONE changes,
so revert them for now.

Conflicts:

fs/fs-writeback.c
mm/page-writeback.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 7ea80859 26-May-2010 Christoph Hellwig <hch@lst.de>

drop unused dentry argument to ->fsync

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8018ab05 22-Mar-2010 Christoph Hellwig <hch@lst.de>

sanitize vfs_fsync calling conventions

Now that the last user passing a NULL file pointer is gone we can remove
the redundant dentry argument and associated hacks inside vfs_fsynmc_range.

The next step will be removig the dentry argument from ->fsync, but given
the luck with the last round of method prototype changes I'd rather
defer this until after the main merge window.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 01a05b33 23-Mar-2010 Al Viro <viro@zeniv.linux.org.uk>

new helper: iterate_supers()

... and switch the simple "loop over superblocks and do something"
loops to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8edd64bd 22-Mar-2010 Al Viro <viro@zeniv.linux.org.uk>

get rid of restarts in sync_filesystems()

At the same time we can kill s_need_restart and local mutex in there.
__put_super() made public for a while; will be gone later.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 551de6f3 22-Mar-2010 Al Viro <viro@zeniv.linux.org.uk>

Leave superblocks on s_list until the end

We used to remove from s_list and s_instances at the same
time. So let's *not* do the former and skip superblocks
that have empty s_instances in the loops over s_list.

The next step, of course, will be to get rid of rescan logics
in those loops.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e913fc82 16-May-2010 Jens Axboe <jens.axboe@oracle.com>

writeback: fix WB_SYNC_NONE writeback from umount

When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
writeback to kick off writeback of pending dirty inodes, then follow
that up with a WB_SYNC_ALL to wait for it. Since umount already holds
the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
since WB_SYNC_ALL writeback is a data integrity operation and thus
a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
it's a lot slower.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 5129a469 25-Apr-2010 Jörn Engel <joern@logfs.org>

Catch filesystems lacking s_bdi

noop_backing_dev_info is used only as a flag to mark filesystems that
don't have any backing store, like tmpfs, procfs, spufs, etc.

Signed-off-by: Joern Engel <joern@logfs.org>

Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
to the noop_backing_dev_info is not legal and will not result in
them being flushed, but we already catch this condition in
__mark_inode_dirty() when checking for a registered bdi.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 5fb324ad 16-Feb-2010 Christoph Hellwig <hch@infradead.org>

quota: move code from sync_quota_sb into vfs_quota_sync

Currenly sync_quota_sb does a lot of sync and truncate action that only
applies to "VFS" style quotas and is actively harmful for the sync
performance in XFS. Move it into vfs_quota_sync and add a wait parameter
to ->quota_sync to tell if we need it or not.

My audit of the GFS2 code says it's also not needed given the way GFS2
implements quotas, but I'd be happy if this can get a detailed review.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 7a0ad10c 17-Dec-2009 Christoph Hellwig <hch@lst.de>

fold do_sync_file_range into sys_sync_file_range

We recently go rid of all callers of do_sync_file_range as they're better
served with vfs_fsync or the filemap_write_and_wait. Now that
do_sync_file_range is down to a single caller fold it into it so that people
don't start using it again accidentally. While at it also switch it from
using __filemap_fdatawrite_range(..., WB_SYNC_ALL) to the more clear
filemap_fdatawrite_range().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 94004ed7 30-Sep-2009 Christoph Hellwig <hch@lst.de>

kill wait_on_page_writeback_range

All callers really want the more logical filemap_fdatawait_range interface,
so convert them to use it and merge wait_on_page_writeback_range into
filemap_fdatawait_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 6b2f3d1f 27-Oct-2009 Christoph Hellwig <hch@lst.de>

vfs: Implement proper O_SYNC semantics

While Linux provided an O_SYNC flag basically since day 1, it took until
Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
since that day we had generic_osync_around with only minor changes and the
great "For now, when the user asks for O_SYNC, we'll actually give
O_DSYNC" comment. This patch intends to actually give us real O_SYNC
semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC
patches which are required before this patch it's actually surprisingly
simple, we just need to figure out when to set the datasync flag to
vfs_fsync_range and when not.

This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
numerical value to keep binary compatibility, and adds a new real O_SYNC
flag. To guarantee backwards compatiblity it is defined as expanding to
both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
sure we are backwards-compatible when compiled against the new headers.

This also means that all places that don't care about the differences can
just check O_DSYNC and get the right behaviour for O_SYNC, too - only
places that actuall care need to check __O_SYNC in addition. Drivers and
network filesystems have been updated in a fail safe way to always do the
full sync magic if O_DSYNC is set. The few places setting O_SYNC for
lower layers are kept that way for now to stay failsafe.

We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
to make sure we always get these sane options.

Note that parisc really screwed up their headers as they already define a
O_DSYNC that has always been a no-op. We try to repair it by using it for
the new O_DSYNC and redefinining O_SYNC to send both the traditional
O_SYNC numerical value _and_ the O_DSYNC one.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger@sun.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>


# 1fe72eaa 22-Sep-2009 H Hartley Sweeten <hartleys@visionengravers.com>

fs/buffer.c: clean up EXPORT* macros

According to Documentation/CodingStyle the EXPORT* macro should follow
immediately after the closing function brace line.

Also, mark_buffer_async_write_endio() and do_thaw_all() are not used
elsewhere so they should be marked as static.

In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to
that file.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32a88aa1 16-Sep-2009 Jens Axboe <jens.axboe@oracle.com>

fs: Assign bdi in super_block

We do this automatically in get_sb_bdev() from the set_bdev_super()
callback. Filesystems that have their own private backing_dev_info
must assign that in ->fill_super().

Note that ->s_bdi assignment is required for proper writeback!

Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 2daea67e 02-Sep-2009 Christoph Hellwig <hch@lst.de>

fsync: wait for data writeout completion before calling ->fsync

Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
the data, the calls into ->fsync to write out the metadata and then finally
calls filemap_fdatawait to wait for the data I/O to complete. What sounds
like a clever micro-optimization actually is nast trap for many filesystems.

For many modern filesystems i_size or other inode information is only
updated on I/O completion and we need to wait for I/O to finish before
we can write out the metadata. For old fashionen filesystems that
instanciate blocks during the actual write and also update the metadata
at that point it opens up a large window were we could expose uninitialized
blocks after a crash. While a few filesystems that need it already wait
for the I/O to finish inside their ->fsync methods it is rather suboptimal
as it is done under the i_mutex and also always for the whole file instead
of just a part as we could do for O_SYNC handling.

Here is a small audit of all fsync instances in the tree:

- spufs_mfc_fsync:
- ps3flash_fsync:
- vol_cdev_fsync:
- printer_fsync:
- fb_deferred_io_fsync:
- bad_file_fsync:
- simple_sync_file:

don't care - filesystems/drivers do't use the page cache or are
purely in-memory.

- simple_fsync:
- file_fsync:
- affs_file_fsync:
- fat_file_fsync:
- jfs_fsync:
- ubifs_fsync:
- reiserfs_dir_fsync:
- reiserfs_sync_file:

never touch pagecache themselves. We need to wait before if we do
not want to expose stale data after an allocation.

- afs_fsync:
- fuse_fsync_common:

do the waiting writeback itself in awkward ways, would benefit from
proper semantics

- block_fsync:

Does a filemap_write_and_wait on the block device inode. Because we
now have f_mapping that is the same inode we call it on in vfs_fsync.
So just removing it and letting the VFS do the work in one go would
be an improvement.

- btrfs_sync_file:
- cifs_fsync:
- xfs_file_fsync:

need the wait first and currently do it themselves. would benefit from
doing it outside i_mutex.

- coda_fsync:
- ecryptfs_fsync:
- exofs_file_fsync:
- shm_fsync:

only passes the fsync through to the lower layer

- ext3_sync_file:

doesn't seem to care, comments are confusing.

- ext4_sync_file:

would need the wait to work correctly for delalloc mode with late
i_size updates. Otherwise the ext3 comment applies.

currently implemens it's own writeback and wait in an odd way,
could benefit from doing it properly.

- gfs2_fsync:

not needed for journaled data mode, but probably harmless there.
Currently writes back data asynchronously itself. Needs some
major audit.

- hostfs_fsync:

just calls fsync/datasync on the host FD. Without the wait before
data might not even be inflight yet if we're unlucky.

- hpfs_file_fsync:
- ncp_fsync:

no-ops. Dangerous before and after.

- jffs2_fsync:

just calls jffs2_flush_wbuf_gc, not sure how this relates to data.

- nfs_fsync_dir:

just increments stats, claims all directory operations are synchronous

- nfs_file_fsync:

only writes out data??? Looks very odd.

- nilfs_sync_file:

looks like it expects all data done, but not sure from the code

- ntfs_dir_fsync:
- ntfs_file_fsync:

appear to do their own data writeback. Very convoluted code.

- ocfs2_sync_file:

does it's own data writeback, but no wait. probably needs the wait.

- smb_fsync:

according to a comment expects all pages written already, probably needs
the wait before.

This patch only changes vfs_fsync_range, removal of the wait in the methods
that have it is left to the filesystem maintainers. Note that most
filesystems really do need an audit for their fsync methods given the
gems found in this very brief audit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 148f948b 17-Aug-2009 Jan Kara <jack@suse.cz>

vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode

Introduce new function for generic inode syncing (vfs_fsync_range) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.

Use these new helpers for syncing from generic VFS functions. This makes
O_SYNC writes to block devices acquire i_mutex for syncing. If we really
care about this, we can make block_fsync() drop the i_mutex and reacquire
it before it returns.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 03ba3782 09-Sep-2009 Jens Axboe <jens.axboe@oracle.com>

writeback: switch to per-bdi threads for flushing data

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:

r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# d8a8559c 01-Sep-2009 Jens Axboe <jens.axboe@oracle.com>

writeback: get rid of generic_sync_sb_inodes() export

This adds two new exported functions:

- writeback_inodes_sb(), which only attempts to writeback dirty inodes on
this super_block, for WB_SYNC_NONE writeout.
- sync_inodes_sb(), which writes out all dirty inodes on this super_block
and also waits for the IO to complete.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 3beab0b4 05-Jul-2009 Zhang, Yanmin <yanmin_zhang@linux.intel.com>

sys_sync(): fix 16% performance regression in ffsb create_4k test

I run many ffsb test cases on JBODs (typically 13/12 disks). Comparing
with kernel 2.6.30, 2.6.31-rc1 has about 16% regression with
ffsb_create_4k. The sub test case creates files continuously for 10
minitues and every file is 1MB.

Bisect located below patch.

5cee5815d1564bbbd505fea86f4550f1efdb5cd0 is first bad commit
commit 5cee5815d1564bbbd505fea86f4550f1efdb5cd0
Author: Jan Kara <jack@suse.cz>
Date: Mon Apr 27 16:43:51 2009 +0200

vfs: Make sys_sync() use fsync_super() (version 4)

It is unnecessarily fragile to have two places (fsync_super() and do_sync())
doing data integrity sync of the filesystem. Alter __fsync_super() to
accommodate needs of both callers and use it. So after this patch
__fsync_super() is the only place where we gather all the calls needed to
properly send all data on a filesystem to disk.

As a matter of fact, ffsb calls sys_sync in the end to make sure all data
is flushed to disks and the flushing is counted into the result. vmstat
shows ffsb is blocked when syncing for a long time. With 2.6.30, ffsb is
blocked for a short time.

I checked the patch and did experiments to recover the original methods.
Eventually, the root cause is the patch deletes the calling to
wakeup_pdflush when syncing, so only ffsb is blocked on disk I/O.
wakeup_pdflush could ask pdflush to write back pages with ffsb at the
same time.

[akpm@linux-foundation.org: restore comment too]
Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0c95ee19 08-Jun-2009 Christoph Hellwig <hch@lst.de>

remove the call to ->write_super in __sync_filesystem

Now that all filesystems provide ->sync_fs methods we can change
__sync_filesystem to only call ->sync_fs.

This gives us a clear separation between periodic writeouts which
are driven by ->write_super and data integrity syncs that go
through ->sync_fs. (modulo file_fsync which is also going away)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ebc1ac16 11-May-2009 Christoph Hellwig <hch@lst.de>

->write_super lock_super pushdown

Push down lock_super into ->write_super instances and remove it from the
caller.

Following filesystem don't need ->s_lock in ->write_super and are skipped:

* bfs, nilfs2 - no other uses of s_lock and have internal locks in
->write_super
* ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
* reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
->write_super
* xfs - no other uses of s_lock and uses internal lock (buffer lock on
superblock buffer) to serialize ->write_super. Also xfs_fs_write_super
is superflous and will go away in the next merge window

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5af7926f 05-May-2009 Christoph Hellwig <hch@lst.de>

enforce ->sync_fs is only called for rw superblock

Make sure a superblock really is writeable by checking MS_RDONLY
under s_umount. sync_filesystems needed some re-arragement for
that, but all but one sync_filesystem caller had the correct locking
already so that we could add that check there. cachefiles grew
s_umount locking.

I've also added a WARN_ON to sync_filesystem to assert this for
future callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c3f8a40c 27-Apr-2009 Jan Kara <jack@suse.cz>

quota: Introduce writeout_quota_sb() (version 4)

Introduce this function which just writes all the quota structures but
avoids all the syncing and cache pruning work to expose quota structures
to userspace. Use this function from __sync_filesystem when wait == 0.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 850b201b 27-Apr-2009 Christoph Hellwig <hch@lst.de>

quota: cleanup dquota sync functions (version 4)

Currently the VFS calls vfs_dq_sync to sync out disk quotas for a given
superblock. This is a small wrapper around sync_dquots which for the
case of a non-NULL superblock is a small wrapper around quota_sync_sb.

Just make quota_sync_sb global (rename it to sync_quota_sb) and call it
directly. Also call it directly for those cases in quota.c that have a
superblock and leave sync_dquots purely an iterator over sync_quota_sb and
remove it's superblock argument.

To make this nicer move the check for the lack of a quota_sync method
from the callers into sync_quota_sb.

[folded build fix from Alexander Beregalov <a.beregalov@gmail.com>]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 60b0680f 27-Apr-2009 Jan Kara <jack@suse.cz>

vfs: Rename fsync_super() to sync_filesystem() (version 4)

Rename the function so that it better describe what it really does. Also
remove the unnecessary include of buffer_head.h.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c15c54f5 27-Apr-2009 Jan Kara <jack@suse.cz>

vfs: Move syncing code from super.c to sync.c (version 4)

Move sync_filesystems(), __fsync_super(), fsync_super() from
super.c to sync.c where it fits better.

[build fixes folded]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5cee5815 27-Apr-2009 Jan Kara <jack@suse.cz>

vfs: Make sys_sync() use fsync_super() (version 4)

It is unnecessarily fragile to have two places (fsync_super() and do_sync())
doing data integrity sync of the filesystem. Alter __fsync_super() to
accommodate needs of both callers and use it. So after this patch
__fsync_super() is the only place where we gather all the calls needed to
properly send all data on a filesystem to disk.

Nice bonus is that we get a complete livelock avoidance and write_supers()
is now only used for periodic writeback of superblocks.

sync_blockdevs() introduced a couple of patches ago is gone now.

[build fixes folded]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5a3e5cb8 27-Apr-2009 Jan Kara <jack@suse.cz>

vfs: Fix sys_sync() and fsync_super() reliability (version 4)

So far, do_sync() called:
sync_inodes(0);
sync_supers();
sync_filesystems(0);
sync_filesystems(1);
sync_inodes(1);

This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
and others are hit by this) when racing e.g. with background writeback. A
similar problem hits also other filesystems (e.g. ext2) because of
write_supers() being called before the sync_inodes(1).

Change the ordering of calls in do_sync() - this requires a new function
sync_blockdevs() to preserve the property that block devices are always synced
after write_super() / sync_fs() call.

The same issue is fixed in __fsync_super() function used on umount /
remount read-only.

[AV: build fixes]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a2a9537a 17-Mar-2009 Jens Axboe <jens.axboe@oracle.com>

Get rid of pdflush_operation() in emergency sync and remount

Opencode a cheasy approach with kevent. The idea here is that we'll
add some generic delayed work infrastructure, which probably wont be
based on pdflush (or maybe it will, in which case we can just add it
back).

This is in preparation for getting rid of pdflush completely.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 9e3509e2 26-Jan-2009 Jan Kara <jack@suse.cz>

vfs: Use lowercase names of quota functions

Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
CC: Alexander Viro <viro@zeniv.linux.org.uk>


# a5f8fa9e 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] System call wrappers part 09

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 6673e0c3 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] System call wrapper special cases

System calls with an unsigned long long argument can't be converted with
the standard wrappers since that would include a cast to long, which in
turn means that we would lose the upper 32 bit on 32 bit architectures.
Also semctl can't use the standard wrapper since it has a 'union'
parameter.

So we handle them as special case and add some extra wrappers instead.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# ee53a891 06-Jan-2009 Nick Piggin <npiggin@suse.de>

mm: do_sync_mapping_range integrity fix

Chris Mason notices do_sync_mapping_range didn't actually ask for data
integrity writeout. Unfortunately, it is advertised as being usable for
data integrity operations.

This is a data integrity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4c728ef5 22-Dec-2008 Christoph Hellwig <hch@lst.de>

add a vfs_fsync helper

Fsync currently has a fdatawrite/fdatawait pair around the method call,
and a mutex_lock/unlock of the inode mutex. All callers of fsync have
to duplicate this, but we have a few and most of them don't quite get
it right. This patch adds a new vfs_fsync that takes care of this.
It's a little more complicated as usual as ->fsync might get a NULL file
pointer and just a dentry from nfsd, but otherwise gets afile and we
want to take the mapping and file operations from it when it is there.

Notes on the fsync callers:

- ecryptfs wasn't calling filemap_fdatawrite / filemap_fdatawait on the
lower file
- coda wasn't calling filemap_fdatawrite / filemap_fdatawait on the host
file, and returning 0 when ->fsync was missing
- shm wasn't calling either filemap_fdatawrite / filemap_fdatawait nor
taking i_mutex. Now given that shared memory doesn't have disk
backing not doing anything in fsync seems fine and I left it out of
the vfs_fsync conversion for now, but in that case we might just
not pass it through to the lower file at all but just call the no-op
simple_sync_file directly.

[and now actually export vfs_fsync]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cce77081 23-Jul-2008 Pavel Machek <pavel@suse.cz>

SYNC_FILE_RANGE_WRITE may and will block. Document that.

[akpm@linux-foundation.org: fix comment text]
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 762873c2 29-Apr-2008 OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

vfs: fix unconditional write_super() call in file_fsync()

We need to check ->s_dirt before calling write_super(). It became the cause
of an unneeded write.

This bug was noticed by Sudhanshu Saxena.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# edd5cd4a 27-Jun-2007 David Woodhouse <dwmw2@infradead.org>

Introduce fixed sys_sync_file_range2() syscall, implement on PowerPC and ARM

Not all the world is an i386. Many architectures need 64-bit arguments to be
aligned in suitable pairs of registers, and the original
sys_sync_file_range(int, loff_t, loff_t, int) was therefore wasting an
argument register for padding after the first integer. Since we don't
normally have more than 6 arguments for system calls, that left no room for
the final argument on some architectures.

Fix this by introducing sys_sync_file_range2(int, int, loff_t, loff_t) which
all fits nicely. In fact, ARM already had that, but called it
sys_arm_sync_file_range. Move it to fs/sync.c and rename it, then implement
the needed compatibility routine. And stop the missing syscall check from
bitching about the absence of sys_sync_file_range() if we've implemented
sys_sync_file_range2() instead.

Tested on PPC32 and with 32-bit and 64-bit userspace on PPC64.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ef51c976 08-May-2007 Mark Fasheh <mark.fasheh@oracle.com>

Remove do_sync_file_range()

Remove do_sync_file_range() and convert callers to just use
do_sync_mapping_range().

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5b04aa3a 01-Mar-2007 Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] Turn do_sync_file_range() into do_sync_mapping_range()

do_sync_file_range() accepts a file * from which it takes an address_space to
sync. Abstract out the bulk of the function into do_sync_mapping_range()
which takes the address_space directly. This way callers who want to sync an
address_space directly can take advantage of the functionality provided.

do_sync_file_range() is preserved as a small wrapper around
do_sync_mapping_range().

Ocfs2 in particular would like to use this to initiate a sync of a specific
inode range during truncate, where a file * may not be available.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0f7fc9e4 08-Dec-2006 Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>

[PATCH] VFS: change struct file to use struct path

This patch changes struct file to use struct path instead of having
independent pointers to struct dentry and struct vfsmount, and converts all
users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

Additionally, it adds two #define's to make the transition easier for users of
the f_dentry and f_vfsmnt.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 914e2637 18-Oct-2006 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] severing fs.h, radix-tree.h -> sched.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cf9a2ae8 29-Aug-2006 David Howells <dhowells@redhat.com>

[PATCH] BLOCK: Move functions out of buffer code [try #6]

Move some functions out of the buffering code that aren't strictly buffering
specific. This is a precursor to being able to disable the block layer.

(*) Moved some stuff out of fs/buffer.c:

(*) The file sync and general sync stuff moved to fs/sync.c.

(*) The superblock sync stuff moved to fs/super.c.

(*) do_invalidatepage() moved to mm/truncate.c.

(*) try_to_release_page() moved to mm/filemap.c.

(*) Moved some related declarations between header files:

(*) declarations for do_invalidatepage() and try_to_release_page() moved
to linux/mm.h.

(*) __set_page_dirty_buffers() moved to linux/buffer_head.h.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 111ebb6e 23-Jun-2006 OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

[PATCH] writeback: fix range handling

When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".

To make all this sane, the patch changes range of writeback_control.

So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.

And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.

This patch does,

- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,

range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"

or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.

- All callers of ->writepages() sets range_start/end or range_cyclic.

- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5246d050 10-Apr-2006 Andrew Morton <akpm@osdl.org>

[PATCH] sync_file_range(): use unsigned for flags

Ulrich suggested that the `flags' arg to sync_file_range() become unsigned.

Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# f79e2abb 31-Mar-2006 Andrew Morton <akpm@osdl.org>

[PATCH] sys_sync_file_range()

Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
Reasons:

- It's more flexible. Things which would require two or three syscalls with
fadvise() can be done in a single syscall.

- Using fadvise() in this manner is something not covered by POSIX.

The patch wires up the syscall for x86.

The sycall is implemented in the new fs/sync.c. The intention is that we can
move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.

Documentation for the syscall is in fs/sync.c.

A test app (sync_file_range.c) is in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
NFS_DATA_SYNC which is hopefully the more common."

Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
the queue is congested. This is trivial to fix: add a new flag bit, set
wbc->nonblocking. But I'm not sure that we want to expose implementation
details down to that level.

Note: it's notable that we can sync an fd which wasn't opened for writing.
Same with fsync() and fdatasync()).

Note: the code takes some care to handle attempts to sync file contents
outside the 16TB offset on 32-bit machines. It makes such attempts appear to
succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
requests fail...

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>