History log of /linux-master/fs/file_table.c
Revision Date Author Comments
# bac0a9e5 08-Feb-2024 Christian Brauner <brauner@kernel.org>

file: add alloc_file_pseudo_noaccount()

When we open block devices as files we want to make sure to not charge
them against the open file limit of the caller as that can cause
spurious failures.

Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 0873add0 08-Feb-2024 Christian Brauner <brauner@kernel.org>

file: prepare for new helper

In order to add a helper to open files that aren't accounted split
alloc_file() and parts of alloc_file_pseudo() into helpers. One to
prepare a path, another one to setup the file.

Suggested-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240129160241.GA2793@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>


# cd3cec0a 15-Feb-2024 Roberto Sassu <roberto.sassu@huawei.com>

ima: Move to LSM infrastructure

Move hardcoded IMA function calls (not appraisal-specific functions) from
various places in the kernel to the LSM infrastructure, by introducing a
new LSM named 'ima' (at the end of the LSM list and always enabled like
'integrity').

Having IMA before EVM in the Makefile is sufficient to preserve the
relative order of the new 'ima' LSM in respect to the upcoming 'evm' LSM,
and thus the order of IMA and EVM function calls as when they were
hardcoded.

Make moved functions as static (except ima_post_key_create_or_update(),
which is not in ima_main.c), and register them as implementation of the
respective hooks in the new function init_ima_lsm().

Select CONFIG_SECURITY_PATH, to ensure that the path-based LSM hook
path_post_mknod is always available and ima_post_path_mknod() is always
executed to mark files as new, as before the move.

A slight difference is that IMA and EVM functions registered for the
inode_post_setattr, inode_post_removexattr, path_post_mknod,
inode_post_create_tmpfile, inode_post_set_acl and inode_post_remove_acl
won't be executed for private inodes. Since those inodes are supposed to be
fs-internal, they should not be of interest to IMA or EVM. The S_PRIVATE
flag is used for anonymous inodes, hugetlbfs, reiserfs xattrs, XFS scrub
and kernel-internal tmpfs files.

Conditionally register ima_post_key_create_or_update() if
CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS is enabled. Also, conditionally register
ima_kernel_module_request() if CONFIG_INTEGRITY_ASYMMETRIC_KEYS is enabled.

Finally, add the LSM_ID_IMA case in lsm_list_modules_test.c.

Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Stefan Berger <stefanb@linux.ibm.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>


# f09068b5 15-Feb-2024 Roberto Sassu <roberto.sassu@huawei.com>

security: Introduce file_release hook

In preparation for moving IMA and EVM to the LSM infrastructure, introduce
the file_release hook.

IMA calculates at file close the new digest of the file content and writes
it to security.ima, so that appraisal at next file access succeeds.

The new hook cannot return an error and cannot cause the operation to be
reverted.

Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Stefan Berger <stefanb@linux.ibm.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>


# 9024b4c9 11-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

d_alloc_pseudo(): move setting ->d_op there from the (sole) caller

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9d5b9475 20-Nov-2023 Joel Granados <j.granados@samsung.com>

fs: Remove the now superfluous sentinel elements from ctl_table array

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel elements ctl_table struct. Special attention was placed in
making sure that an empty directory for fs/verity was created when
CONFIG_FS_VERITY_BUILTIN_SIGNATURES is not defined. In this case we use the
register sysctl call that expects a size.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>


# 372a34e6 30-Nov-2023 Christian Brauner <brauner@kernel.org>

fs: replace f_rcuhead with f_task_work

The naming is actively misleading since we switched to
SLAB_TYPESAFE_BY_RCU. rcu_head is #define callback_head. Use
callback_head directly and rename f_rcuhead to f_task_work.

Add comments in there to explain what it's used for.

Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-3-e73ca6f4ea83@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 7cb537b6 25-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

file: massage cleanup of files that failed to open

A file that has never gotten FMODE_OPENED will never have RCU-accessed
references, its final fput() is equivalent to file_free() and if it
doesn't have FMODE_BACKING either, it can be done from any context and
won't need task_work treatment.

Now that we have SLAB_TYPESAFE_BY_RCU we can simplify this and have
other callers benefit. All of that can be achieved easier is to make
fput() recoginze that case and call file_free() directly.

No need to introduce a special primitive for that. It also allowed
things like failing dentry_open() could benefit from that as well.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[Christian Brauner <brauner@kernel.org>: massage commit message]
Link: https://lore.kernel.org/r/20231126020834.GC38156@ZenIV
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 68279f9c 11-Oct-2023 Alexey Dobriyan <adobriyan@gmail.com>

treewide: mark stuff as __ro_after_init

__read_mostly predates __ro_after_init. Many variables which are marked
__read_mostly should have been __ro_after_init from day 1.

Also, mark some stuff as "const" and "__init" while I'm at it.

[akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning]
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# def3ae83 09-Oct-2023 Amir Goldstein <amir73il@gmail.com>

fs: store real path instead of fake path in backing file f_path

A backing file struct stores two path's, one "real" path that is referring
to f_inode and one "fake" path, which should be displayed to users in
/proc/<pid>/maps.

There is a lot more potential code that needs to know the "real" path, then
code that needs to know the "fake" path.

Instead of code having to request the "real" path with file_real_path(),
store the "real" path in f_path and require code that needs to know the
"fake" path request it with file_user_path().
Replace the file_real_path() helper with a simple const accessor f_path().

After this change, file_dentry() is not expected to observe any files
with overlayfs f_path and real f_inode, so the call to ->d_real() should
not be needed. Leave the ->d_real() call for now and add an assertion
in ovl_d_real() to catch if we made wrong assumptions.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegtt48eXhhjDFA1ojcHPNKj3Go6joryCPtEFAKpocyBsnw@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231009153712.1566422-4-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 6cf41fcf 05-Oct-2023 Christian Brauner <brauner@kernel.org>

backing file: free directly

Backing files as used by overlayfs are never installed into file
descriptor tables and are explicitly documented as such. They aren't
subject to rcu access conditions like regular files are.

Their lifetime is bound to the lifetime of the overlayfs file, i.e.,
they're stashed in ovl_file->private_data and go away otherwise. If
they're set as vma->vm_file - which is their main purpose - then they're
subject to regular refcount rules and vma->vm_file can't be installed
into an fdtable after having been set. All in all I don't see any need
for rcu delay here. So free it directly.

This all hinges on such hybrid beasts to never actually be installed
into fdtables which - as mentioned before - is not allowed. So add an
explicit WARN_ON_ONCE() so we catch any case where someone is suddenly
trying to install one of those things into a file descriptor table so we
can have a nice long chat with them.

Link: https://lore.kernel.org/r/20231005-sakralbau-wappnen-f5c31755ed70@brauner
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 0ede61d8 29-Sep-2023 Christian Brauner <brauner@kernel.org>

file: convert to SLAB_TYPESAFE_BY_RCU

In recent discussions around some performance improvements in the file
handling area we discussed switching the file cache to rely on
SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
freeing for files completely. This is a pretty sensitive change overall
but it might actually be worth doing.

The main downside is the subtlety. The other one is that we should
really wait for Jann's patch to land that enables KASAN to handle
SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
exists.

With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
which requires a few changes. So it isn't sufficient anymore to just
acquire a reference to the file in question under rcu using
atomic_long_inc_not_zero() since the file might have already been
recycled and someone else might have bumped the reference.

In other words, callers might see reference count bumps from newer
users. For this reason it is necessary to verify that the pointer is the
same before and after the reference count increment. This pattern can be
seen in get_file_rcu() and __files_get_rcu().

In addition, it isn't possible to access or check fields in struct file
without first aqcuiring a reference on it. Not doing that was always
very dodgy and it was only usable for non-pointer data in struct file.
With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a
reference under rcu or they must hold the files_lock of the fdtable.
Failing to do either one of this is a bug.

Thanks to Jann for pointing out that we need to ensure memory ordering
between reallocations and pointer check by ensuring that all subsequent
loads have a dependency on the second load in get_file_rcu() and
providing a fixup that was folded into this patch.

Cc: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 93faf426 26-Sep-2023 Mateusz Guzik <mjguzik@gmail.com>

vfs: shave work on failed file open

Failed opens (mostly ENOENT) legitimately happen a lot, for example here
are stats from stracing kernel build for few seconds (strace -fc make):

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
0.76 0.076233 5 15040 3688 openat

(this is tons of header files tried in different paths)

In the common case of there being nothing to close (only the file object
to free) there is a lot of overhead which can be avoided.

This is most notably delegation of freeing to task_work, which comes
with an enormous cost (see 021a160abf62 ("fs: use __fput_sync in
close(2)" for an example).

Benchmarked with will-it-scale with a custom testcase based on
tests/open1.c, stuffed into tests/openneg.c:
[snip]
while (1) {
int fd = open("/tmp/nonexistent", O_RDONLY);
assert(fd == -1);

(*iterations)++;
}
[/snip]

Sapphire Rapids, openneg_processes -t 1 (ops/s):
before: 1950013
after: 2914973 (+49%)

file refcount is checked as a safety belt against buggy consumers with
an atomic cmpxchg. Technically it is not necessary, but it happens to
not be measurable due to several other atomics which immediately follow.
Optmizing them away to make this atomic into a problem is left as an
exercise for the reader.

v2:
- unexport fput_badopen and move to fs/internal.h
- handle the refcount with cmpxchg, adjust commentary accordingly
- tweak the commit message

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 021a160a 08-Aug-2023 Linus Torvalds <torvalds@linux-foundation.org>

fs: use __fput_sync in close(2)

close(2) is a special case which guarantees a shallow kernel stack,
making delegation to task_work machinery unnecessary. Said delegation is
problematic as it involves atomic ops and interrupt masking trips, none
of which are cheap on x86-64. Forcing close(2) to do it looks like an
oversight in the original work.

Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
-> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
thread returning to userspace land in resume_user_mode_work(), where
rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
the thread (and it is by default with contemporary glibc).

Sample result when benchmarking open1_processes -t 1 from will-it-scale
(that's an open + close loop) + tmpfs on /tmp, running on the Sapphire
Rapid CPU (ops/s):
stock+RSEQ: 1329857
stock-RSEQ: 1421667 (+7%)
patched: 1523521 (+14.5% / +7%) (with / without rseq)

Patched result is the same regardless of rseq as the codepath is avoided.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# dff745c1 01-Jul-2023 Amir Goldstein <amir73il@gmail.com>

fs: move cleanup from init_file() into its callers

The use of file_free_rcu() in init_file() to free the struct that was
allocated by the caller was hacky and we got what we deserved.

Let init_file() and its callers take care of cleaning up each after
their own allocated resources on error.

Fixes: 62d53c4a1dfe ("fs: use backing_file container for internal files with "fake" f_path") # mainline only
Reported-and-tested-by: syzbot+ada42aab05cf51b00e98@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230701171134.239409-1-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 62d53c4a 15-Jun-2023 Amir Goldstein <amir73il@gmail.com>

fs: use backing_file container for internal files with "fake" f_path

Overlayfs uses open_with_fake_path() to allocate internal kernel files,
with a "fake" path - whose f_path is not on the same fs as f_inode.

Allocate a container struct backing_file for those internal files, that
is used to hold the "fake" ovl path along with the real path.

backing_file_real_path() can be used to access the stored real path.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230615112229.2143178-5-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 8a05a8c3 15-Jun-2023 Amir Goldstein <amir73il@gmail.com>

fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers

Use a common helper init_file() instead of __alloc_file() for
alloc_empty_file*() helpers and improrve the documentation.

This is needed for a follow up patch that allocates a backing_file
container.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230615112229.2143178-4-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 5970e15d 20-Nov-2022 Jeff Layton <jlayton@kernel.org>

filelock: move file locking definitions to separate header file

The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.

Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>


# d6da19c9 16-Aug-2022 Amir Goldstein <amir73il@gmail.com>

locks: fix TOCTOU race when granting write lease

Thread A trying to acquire a write lease checks the value of i_readcount
and i_writecount in check_conflicting_open() to verify that its own fd
is the only fd referencing the file.

Thread B trying to open the file for read will call break_lease() in
do_dentry_open() before incrementing i_readcount, which leaves a small
window where thread A can acquire the write lease and then thread B
completes the open of the file for read without breaking the write lease
that was acquired by thread A.

Fix this race by incrementing i_readcount before checking for existing
leases, same as the case with i_writecount.

Use a helper put_file_access() to decrement i_readcount or i_writecount
in do_dentry_open() and __fput().

Fixes: 387e3746d01c ("locks: eliminate false positive conflicts for write lease")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 868941b1 29-Jun-2022 Jason A. Donenfeld <Jason@zx2c4.com>

fs: remove no_llseek

Now that all callers of ->llseek are going through vfs_llseek(), we
don't gain anything by keeping no_llseek around. Nothing actually calls
it and setting ->llseek to no_lseek is completely equivalent to
leaving it NULL.

Longer term (== by the end of merge window) we want to remove all such
intializations. To simplify the merge window this commit does *not*
touch initializers - it only defines no_llseek as NULL (and simplifies
the tests on file opening).

At -rc1 we'll need do a mechanical removal of no_llseek -

git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e7478158 29-Jun-2022 Jason A. Donenfeld <Jason@zx2c4.com>

fs: clear or set FMODE_LSEEK based on llseek function

Pipe-like behaviour on llseek(2) (i.e. unconditionally failing with
-ESPIPE) can be expresses in 3 ways:
1) ->llseek set to NULL in file_operations
2) ->llseek set to no_llseek in file_operations
3) FMODE_LSEEK *not* set in ->f_mode.

Enforce (3) in cases (1) and (2); that will allow to simplify the
checks and eventually get rid of no_llseek boilerplate.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 164f4064 22-May-2022 Al Viro <viro@zeniv.linux.org.uk>

keep iocb_flags() result cached in struct file

* calculate at the time we set FMODE_OPENED (do_dentry_open() for normal
opens, alloc_file() for pipe()/socket()/etc.)
* update when handling F_SETFL
* keep in a new field - file->f_iocb_flags; since that thing is needed only
before the refcount reaches zero, we can put it into the same anon union
where ->f_rcuhead and ->f_llist live - those are used only after refcount
reaches zero.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e87f2c26 22-May-2022 Al Viro <viro@zeniv.linux.org.uk>

struct file: use anonymous union member for rcuhead and llist

Once upon a time we couldn't afford anon unions; these days minimal
gcc version had been raised enough to take care of that.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 81132a39 01-Nov-2021 Gou Hao <gouhao@uniontech.com>

fs: remove fget_many and fput_many interface

These two interface were added in 091141a42 commit,
but now there is no place to call them.

The only user of fput/fget_many() was removed in commit
62906e89e63b ("io_uring: remove file batch-get optimisation").

A user of get_file_rcu_many() were removed in commit
f073531070d2 ("init: add an init_dup helper").

And replace atomic_long_sub/add to atomic_long_dec/inc
can improve performance.

Here are the test results of unixbench:

Cmd: ./Run -c 64 context1

Without patch:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe-based Context Switching 4000.0 2798407.0 6996.0
========
System Benchmarks Index Score (Partial Only) 6996.0

With patch:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe-based Context Switching 4000.0 3486268.8 8715.7
========
System Benchmarks Index Score (Partial Only) 8715.7

Signed-off-by: Gou Hao <gouhao@uniontech.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f0043206 03-Apr-2022 Trond Myklebust <trond.myklebust@hammerspace.com>

SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()

We must ensure that all sockets are closed before we call xprt_free()
and release the reference to the net namespace. The problem is that
calling fput() will defer closing the socket until delayed_fput() gets
called.
Let's fix the situation by allowing rpciod and the transport teardown
code (which runs on the system wq) to call __fput_sync(), and directly
close the socket.

Reported-by: Felix Fu <foyjog@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: a73881c96d73 ("SUNRPC: Fix an Oops in udp_poll()")
Cc: stable@vger.kernel.org # 5.1.x: 3be232f11a3c: SUNRPC: Prevent immediate close+reconnect
Cc: stable@vger.kernel.org # 5.1.x: 89f42494f92f: SUNRPC: Don't call connect() more than once on a TCP socket
Cc: stable@vger.kernel.org # 5.1.x
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>


# a3580ac9 14-Feb-2022 Luis Chamberlain <mcgrof@kernel.org>

fs/file_table: fix adding missing kmemleak_not_leak()

Commit b42bc9a3c511 ("Fix regression due to "fs: move binfmt_misc sysctl
to its own file") fixed a regression, however it failed to add a
kmemleak_not_leak().

Fixes: b42bc9a3c511 ("Fix regression due to "fs: move binfmt_misc sysctl to its own file")
Reported-by: Tong Zhang <ztong0001@gmail.com>
Cc: Tong Zhang <ztong0001@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b42bc9a3 09-Feb-2022 Domenico Andreoli <domenico.andreoli@linux.com>

Fix regression due to "fs: move binfmt_misc sysctl to its own file"

Commit 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file") did
not go unnoticed, binfmt-support stopped to work on my Debian system
since v5.17-rc2 (did not check with -rc1).

The existance of the /proc/sys/fs/binfmt_misc is a precondition for
attempting to mount the binfmt_misc fs, which in turn triggers the
autoload of the binfmt_misc module. Without it, no module is loaded and
no binfmt is available at boot.

Building as built-in or manually loading the module and mounting the fs
works fine, it's therefore only a matter of interaction with user-space.
I could try to improve the Debian systemd configuration but I can't say
anything about the other distributions.

This patch restores a working system right after boot.

Fixes: 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file")
Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Tong Zhang <ztong0001@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 204d5a24 21-Jan-2022 Luis Chamberlain <mcgrof@kernel.org>

fs: move fs stat sysctls to file_table.c

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

We can create the sysctl dynamically on early init for fs stat to help
with this clutter. This dusts off the fs stat syctls knobs and puts
them into where they are declared.

Link: https://lkml.kernel.org/r/20211129205548.605569-3-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 319c1517 01-Oct-2020 Al Viro <viro@zeniv.linux.org.uk>

epoll: take epitem list out of struct file

Move the head of epitem list out of struct file; for epoll ones it's
moved into struct eventpoll (->refs there), for non-epoll - into
the new object (struct epitem_head). In place of ->f_ep_links we
leave a pointer to the list head (->f_ep).

->f_ep is protected by ->f_lock and it's zeroed as soon as the list
of epitems becomes empty (that can happen only in ep_remove() by
now).

The list of files for reverse path check is *not* going through
struct file now - it's a single-linked list going through epitem_head
instances. It's terminated by ERR_PTR(-1) (== EP_UNACTIVE_POINTER),
so the elements of list can be distinguished by head->next != NULL.

epitem_head instances are allocated at ep_insert() time (by
attach_epitem()) and freed either by ep_remove() (if it empties
the set of epitems *and* epitem_head does not belong to the
reverse path check list) or by clear_tfile_check_list() when
the list is emptied (if the set of epitems is empty by that
point). Allocations are done from a separate slab - minimal kmalloc()
size is too large on some architectures.

As the result, we trim struct file _and_ get rid of the games with
temporary file references.

Locking and barriers are interesting (aren't they always); see unlist_file()
and ep_remove() for details. The non-obvious part is that ep_remove() needs
to decide if it will be the one to free the damn thing *before* actually
storing NULL to head->epitems.first - that's what smp_load_acquire is for
in there. unlist_file() lockless path is safe, since we hit it only if
we observe NULL in head->epitems.first and whoever had done that store is
guaranteed to have observed non-NULL in head->next. IOW, their last access
had been the store of NULL into ->epitems.first and we can safely free
the sucker. OTOH, we are under rcu_read_lock() and both epitem and
epitem->file have their freeing RCU-delayed. So if we see non-NULL
->epitems.first, we can grab ->f_lock (all epitems in there share the
same struct file) and safely recheck the emptiness of ->epitems; again,
->next is still non-NULL, so ep_remove() couldn't have freed head yet.
->f_lock serializes us wrt ep_remove(); the rest is trivial.

Note that once head->epitems becomes NULL, nothing can get inserted into
it - the only remaining reference to head after that point is from the
reverse path check list.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 91989c70 16-Oct-2020 Jens Axboe <axboe@kernel.dk>

task_work: cleanup notification modes

A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

Clean this up properly, and define a proper enum for the notification
mode. Now we have:

- TWA_NONE. This is 0, same as before the original change, meaning no
notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
notification.

Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.

Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b6509f6a 29-Jun-2020 Mel Gorman <mgorman@techsingularity.net>

Revert "fs: Do not check if there is a fsnotify watcher on pseudo inodes"

This reverts commit e9c15badbb7b ("fs: Do not check if there is a
fsnotify watcher on pseudo inodes"). The commit intended to eliminate
fsnotify-related overhead for pseudo inodes but it is broken in
concept. inotify can receive events of pipe files under /proc/X/fd and
chromium relies on close and open events for sandboxing. Maxim Levitsky
reported the following

Chromium starts as a white rectangle, shows few white rectangles that
resemble its notifications and then crashes.

The stdout output from chromium:

[mlevitsk@starship ~]$chromium-freeworld
mesa: for the --simplifycfg-sink-common option: may only occur zero or one times!
mesa: for the --global-isel-abort option: may only occur zero or one times!
[3379:3379:0628/135151.440930:ERROR:browser_switcher_service.cc(238)] XXX Init()
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0072
Received signal 11 SEGV_MAPERR 0000004a9048

Crashes are not universal but even if chromium does not crash, it certainly
does not work properly. While filtering just modify and access might be
safe, the benefit is not worth the risk hence the revert.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: e9c15badbb7b ("fs: Do not check if there is a fsnotify watcher on pseudo inodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e9c15bad 15-Jun-2020 Mel Gorman <mgorman@techsingularity.net>

fs: Do not check if there is a fsnotify watcher on pseudo inodes

The kernel uses internal mounts created by kern_mount() and populated
with files with no lookup path by alloc_file_pseudo() for a variety of
reasons. An example of such a mount is for anonymous pipes. For pipes,
every vfs_write() regardless of filesystem, calls fsnotify_modify()
to notify of any changes which incurs a small amount of overhead in
fsnotify even when there are no watchers. It can also trigger for reads
and readv and writev, it was simply vfs_write() that was noticed first.

A patch is pending that reduces, but does not eliminate, the overhead of
fsnotify but for files that cannot be looked up via a path, even that
small overhead is unnecessary. The user API for all notification
subsystems (inotify, fanotify, ...) is based on the pathname and a dirfd
and proc entries appear to be the only visible representation of the
files. Proc does not have the same pathname as the internal entry and
the proc inode is not the same as the internal inode so even if fanotify
is used on a file under /proc/XX/fd, no useful events are notified.

This patch changes alloc_file_pseudo() to always opt out of fsnotify by
setting FMODE_NONOTIFY flag so that no check is made for fsnotify
watchers on pseudo files. This should be safe as the underlying helper
for the dentry is d_alloc_pseudo() which explicitly states that no
lookups are ever performed meaning that fanotify should have nothing
useful to attach to.

The test motivating this was "perf bench sched messaging --pipe". On
a single-socket machine using threads the difference of the patch was
as follows.

5.7.0 5.7.0
vanilla nofsnotify-v1r1
Amean 1 1.3837 ( 0.00%) 1.3547 ( 2.10%)
Amean 3 3.7360 ( 0.00%) 3.6543 ( 2.19%)
Amean 5 5.8130 ( 0.00%) 5.7233 * 1.54%*
Amean 7 8.1490 ( 0.00%) 7.9730 * 2.16%*
Amean 12 14.6843 ( 0.00%) 14.1820 ( 3.42%)
Amean 18 21.8840 ( 0.00%) 21.7460 ( 0.63%)
Amean 24 28.8697 ( 0.00%) 29.1680 ( -1.03%)
Amean 30 36.0787 ( 0.00%) 35.2640 * 2.26%*
Amean 32 38.0527 ( 0.00%) 38.1223 ( -0.18%)

The difference is small but in some cases it's outside the noise so
while marginal, there is still some small benefit to ignoring fsnotify
for files allocated via alloc_file_pseudo() in some cases.

Link: https://lore.kernel.org/r/20200615121358.GF3183@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>


# 735e4ae5 01-Jun-2020 Jeff Layton <jlayton@kernel.org>

vfs: track per-sb writeback errors and report them to syncfs

Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.

Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.

This patch (of 2):

Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.

Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.

It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.

This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.

To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.

An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.

I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32927393 24-Apr-2020 Christoph Hellwig <hch@lst.de>

sysctl: pass kernel pointers to ->proc_handler

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7239a40c 18-Aug-2019 Trond Myklebust <trond.myklebust@primarydata.com>

vfs: Export flush_delayed_fput for use by knfsd.

Allow knfsd to flush the delayed fput list so that it can ensure the
cached struct file is closed before it is unlinked.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>


# 457c8996 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# a07b2000 05-Nov-2018 Al Viro <viro@zeniv.linux.org.uk>

vfs: syscall: Add open_tree(2) to reference or clone a mount

open_tree(dfd, pathname, flags)

Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)). flags should be an OR of
some of the following:
* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname. With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question. In other words, the same as mount --rbind
or mount --bind would've taken. The detached tree will be
dissolved on the final close of obtained file. Creation of such
detached trees requires the same capabilities as doing mount --bind.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 091141a4 21-Nov-2018 Jens Axboe <axboe@kernel.dk>

fs: add fget_many() and fput_many()

Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ca79b0c2 28-Dec-2018 Arun KS <arunks@codeaurora.org>

mm: convert totalram_pages and totalhigh_pages variables to atomic

totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3d6357de 28-Dec-2018 Arun KS <arunks@codeaurora.org>

mm: reference totalram_pages and managed_pages once per function

Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.

This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.

totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic. With the change,
preventing poteintial store-to-read tearing comes as a bonus.

This patch (of 4):

This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables. Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior. There are no known bugs as a result of the current
code but it is better to prevent from them in principle.

Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d3b1084d 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: make open_with_fake_path() not contribute to nr_files

Stacking file operations in overlay will store an extra open file for each
overlay file opened.

The overhead is just that of "struct file" which is about 256bytes, because
overlay already pins an extra dentry and inode when the file is open, which
add up to a much larger overhead.

For fear of breaking working setups, don't start accounting the extra file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ee1904ba 17-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

make alloc_file() static

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 183266f2 17-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

new helper: alloc_file_clone()

alloc_file_clone(old_file, mode, ops): create a new struct file with
->f_path equal to that of old_file. pipe converted.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d93aa9d8 09-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

new wrapper: alloc_file_pseudo()

takes inode, vfsmount, name, O_... flags and file_operations and
either returns a new struct file (in which case inode reference we
held is consumed) or returns ERR_PTR(), in which case no refcounts
are altered.

converted aio_private_file() and sock_alloc_file() to it

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4d27f326 09-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

fold put_filp() into fput()

Just check FMODE_OPENED in __fput() and be done with that...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f5d11409 09-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

introduce FMODE_OPENED

basically, "is that instance set up enough for regular fput(), or
do we want put_filp() for that one".

NOTE: the only alloc_file() caller that could be followed by put_filp()
is in arch/ia64/kernel/perfmon.c, which is (Kconfig-level) broken.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ea73ea72 11-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

pass ->f_flags value to alloc_empty_file()

... and have it set the f_flags-derived part of ->f_mode.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6de37b6d 10-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

pass creds to get_empty_filp(), make sure dentry_open() passes the right creds

... and rename get_empty_filp() to alloc_empty_file().

dentry_open() gets creds as argument, but the only thing that sees those is
security_file_open() - file->f_cred still ends up with current_cred(). For
almost all callers it's the same thing, but there are several broken cases.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c9c554f2 11-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

alloc_file(): switch to passing O_... flags instead of FMODE_... mode

... so that it could set both ->f_flags and ->f_mode, without callers
having to set ->f_flags manually.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e8cff84f 09-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

fold security_file_free() into file_free()

.. and the call of file_free() in case of security_file_alloc() failure
in get_empty_filp() should be simply file_free_rcu() - no point in
rcu-delays there, anyway.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9c565035 17-Nov-2017 Yang Shi <yang.s@alibaba-inc.com>

vfs: remove unused hardirq.h

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by vfs at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f3f7c093 15-Nov-2017 Shakeel Butt <shakeelb@google.com>

fs, mm: account filp cache to kmemcg

The allocations from filp cache can be directly triggered by userspace
applications. A buggy application can consume a significant amount of
unaccounted system memory. Though we have not noticed such buggy
applications in our production but upon close inspection, we found that
a lot of machines spend very significant amount of memory on these
caches.

One way to limit allocations from filp cache is to set system level
limit of maximum number of open files. However this limit is shared
between different users on the system and one user can hog this
resource. To cater that, we can charge filp to kmemcg and set the
maximum limit very high and let the memory limit of each user limit the
number of files they can open and indirectly limiting their allocations
from filp cache.

One side effect of this change is that it will allow _sysctl() to return
ENOMEM and the man page of _sysctl() does not specify that. However the
man page also discourages to use _sysctl() at all.

Link: http://lkml.kernel.org/r/20171011190359.34926-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bb02b186 21-Jun-2017 Mimi Zohar <zohar@linux.vnet.ibm.com>

ima: call ima_file_free() prior to calling fasync

The file hash is calculated and written out as an xattr after
calling fasync(). In order for the file data and metadata to be
written out to disk at the same time, this patch calculates the
file hash and stores it as an xattr before calling fasync.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>


# b9ea557e 07-Aug-2017 Byungchul Park <byungchul.park@lge.com>

fput: Don't reinvent the wheel but use existing llist API

Although llist provides proper APIs, they are not used. Make them used.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5660e13d 06-Jul-2017 Jeff Layton <jlayton@kernel.org>

fs: new infrastructure for writeback error handling and reporting

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>


# 5b825c3a 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>

Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a4141d7c 20-Nov-2016 Al Viro <viro@zeniv.linux.org.uk>

constify alloc_file()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4248b0da 06-Aug-2015 Mel Gorman <mgorman@suse.de>

fs, file table: reinit files_stat.max_files after deferred memory initialisation

Dave Hansen reported the following;

My laptop has been behaving strangely with 4.2-rc2. Once I log
in to my X session, I start getting all kinds of strange errors
from applications and see this in my dmesg:

VFS: file-max limit 8192 reached

The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using. This
patch recalculates file-max after deferred memory initialisation. Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.

4.1: files_stat.max_files = 6582781
4.2-rc2: files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467

Small differences with the patch applied and 4.1 but not enough to matter.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alex Ng <alexng@microsoft.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e5e6e97f 04-Jun-2015 Al Viro <viro@zeniv.linux.org.uk>

remove the pointless include of lglock.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 84363182 03-Apr-2015 Al Viro <viro@zeniv.linux.org.uk>

->aio_read and ->aio_write removed

no remaining users

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a457606a 12-Oct-2014 Eric Biggers <ebiggers3@gmail.com>

fs/file_table.c: Update alloc_file() comment

This comment is 5 years outdated; init_file() no longer exists.

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 908c7f19 07-Sep-2014 Tejun Heo <tj@kernel.org>

percpu_counter: add @gfp to percpu_counter_init()

Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>


# 1f7e0616 06-Jun-2014 Joe Perches <joe@perches.com>

fs: convert use of typedef ctl_table to struct ctl_table

This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 293bc982 11-Feb-2014 Al Viro <viro@zeniv.linux.org.uk>

new methods: ->read_iter() and ->write_iter()

Beginning to introduce those. Just the callers for now, and it's
clumsier than it'll eventually become; once we finish converting
aio_read and aio_write instances, the things will get nicer.

For now, these guys are in parallel to ->aio_read() and ->aio_write();
they take iocb and iov_iter, with everything in iov_iter already
validated. File offset is passed in iocb->ki_pos, iov/nr_segs -
in iov_iter.

Main concerns in that series are stack footprint and ability to
split the damn thing cleanly.

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7f7f25e8 11-Feb-2014 Al Viro <viro@zeniv.linux.org.uk>

replace checking for ->read/->aio_read presence with check in ->f_mode

Since we are about to introduce new methods (read_iter/write_iter), the
tests in a bunch of places would have to grow inconveniently. Check
once (at open() time) and store results in ->f_mode as FMODE_CAN_READ
and FMODE_CAN_WRITE resp. It might end up being a temporary measure -
once everything switches from ->aio_{read,write} to ->{read,write}_iter
it might make sense to return to open-coded checks. We'll see...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7f4b36f9 13-Mar-2014 Al Viro <viro@zeniv.linux.org.uk>

get rid of files_defer_init()

the only thing it's doing these days is calculation of
upper limit for fs.nr_open sysctl and that can be done
statically

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 83f936c7 13-Mar-2014 Al Viro <viro@zeniv.linux.org.uk>

mark struct file that had write access grabbed by open()

new flag in ->f_mode - FMODE_WRITER. Set by do_dentry_open() in case
when it has grabbed write access, checked by __fput() to decide whether
it wants to drop the sucker. Allows to stop bothering with mnt_clone_write()
in alloc_file(), along with fewer special_file() checks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4597e695 14-Mar-2014 Al Viro <viro@zeniv.linux.org.uk>

get rid of DEBUG_WRITECOUNT

it only makes control flow in __fput() and friends more convoluted.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# dd20908a 14-Mar-2014 Al Viro <viro@zeniv.linux.org.uk>

don't bother with {get,put}_write_access() on non-regular files

it's pointless and actually leads to wrong behaviour in at least one
moderately convoluted case (pipe(), close one end, try to get to
another via /proc/*/fd and run into ETXTBUSY).

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 78ed8a13 02-Feb-2014 Jeff Layton <jlayton@kernel.org>

locks: rename locks_remove_flock to locks_remove_file

This function currently removes leases in addition to flock locks and in
a later patch we'll have it deal with file-private locks too. Rename it
to locks_remove_file to indicate that it removes locks that are
associated with a particular struct file, and not just flock locks.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>


# 9c225f26 03-Mar-2014 Linus Torvalds <torvalds@linux-foundation.org>

vfs: atomic f_pos accesses as per POSIX

Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.

This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

"2.9.7 Thread Interactions with Regular File Operations

All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2008 when they operate on
regular files or symbolic links: [...]"

and one of the effects is the file position update.

This unprotected file position behavior is not new behavior, and nobody
has ever cared. Until now. Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.

This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.

Reported-by: Yongzhi Pan <panyongzhi@gmail.com>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# eee5cc27 04-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

get rid of s_files and files_lock

The only thing we need it for is alt-sysrq-r (emergency remount r/o)
and these days we can do just as well without going through the
list of files.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 72c2d531 22-Sep-2013 Al Viro <viro@zeniv.linux.org.uk>

file->f_op is never NULL...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c7314d74 20-Oct-2013 Al Viro <viro@zeniv.linux.org.uk>

nfsd regression since delayed fput()

Background: nfsd v[23] had throughput regression since delayed fput
went in; every read or write ends up doing fput() and we get a pair
of extra context switches out of that (plus quite a bit of work
in queue_work itselfi, apparently). Use of schedule_delayed_work()
gives it a chance to accumulate a bit before we do __fput() on all
of them. I'm not too happy about that solution, but... on at least
one real-world setup it reverts about 10% throughput loss we got from
switch to delayed fput.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# be49b30a 11-Sep-2013 Andrew Morton <akpm@linux-foundation.org>

fs/file_table.c:fput(): make comment more truthful

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 184cacab 30-Aug-2013 Al Viro <viro@zeniv.linux.org.uk>

only regular files with FMODE_WRITE need to be on s_files

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4f5e65a1 08-Jul-2013 Oleg Nesterov <oleg@redhat.com>

fput: turn "list_head delayed_fput_list" into llist_head

fput() and delayed_fput() can use llist and avoid the locking.

This is unlikely path, it is not that this change can improve
the performance, but this way the code looks simpler.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 64372501 08-Jul-2013 Andrew Morton <akpm@linux-foundation.org>

fs/file_table.c:fput(): add comment

A missed update to "fput: task_work_add() can fail if the caller has
passed exit_task_work()".

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c77cecee 13-Jun-2013 David Howells <dhowells@redhat.com>

Replace a bunch of file->dentry->d_inode refs with file_inode()

Replace a bunch of file->dentry->d_inode refs with file_inode().

In __fput(), use file->f_inode instead so as not to be affected by any tricks
that file_inode() might grow.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e7b2c406 14-Jun-2013 Oleg Nesterov <oleg@redhat.com>

fput: task_work_add() can fail if the caller has passed exit_task_work()

fput() assumes that it can't be called after exit_task_work() but
this is not true, for example free_ipc_ns()->shm_destroy() can do
this. In this case fput() silently leaks the file.

Change it to fallback to delayed_fput_work if task_work_add() fails.
The patch looks complicated but it is not, it changes the code from

if (PF_KTHREAD) {
schedule_work(...);
return;
}
task_work_add(...)

to
if (!PF_KTHREAD) {
if (!task_work_add(...))
return;
/* fallback */
}
schedule_work(...);

As for shm_destroy() in particular, we could make another fix but I
think this change makes sense anyway. There could be another similar
user, it is not safe to assume that task_work_add() can't fail.

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# dd37978c 01-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

cache the value of file_inode() in struct file

Note that this thing does *not* contribute to inode refcount;
it's pinned down by dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 39b65252 12-Sep-2012 Anatol Pomozov <anatol.pomozov@gmail.com>

fs: Preserve error code in get_empty_filp(), part 2

Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
- not enough memory for file structures
- operation is not allowed
- user is over its limit

Currently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.

Return error through pointer. Change all callers to preserve this error code.

[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1afc99be 14-Feb-2013 Al Viro <viro@zeniv.linux.org.uk>

propagate error from get_empty_filp() to its callers

Based on parts from Anatol's patch (the rest is the next commit).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 72651cac 05-Dec-2012 Jan Kara <jack@suse.cz>

fs: Fix imbalance in freeze protection in mark_files_ro()

File descriptors (even those for writing) do not hold freeze protection.
Thus mark_files_ro() must call __mnt_drop_write() to only drop protection
against remount read-only. Calling mnt_drop_write_file() as we do now
results in:

[ BUG: bad unlock balance detected! ]
3.7.0-rc6-00028-g88e75b6 #101 Not tainted
-------------------------------------
kworker/1:2/79 is trying to release lock (sb_writers) at:
[<ffffffff811b33b4>] mnt_drop_write+0x24/0x30
but there are no more locks to release!

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4b2c551f 09-Oct-2012 Lai Jiangshan <laijs@cn.fujitsu.com>

lglock: add DEFINE_STATIC_LGLOCK()

When the lglock doesn't need to be exported we can use
DEFINE_STATIC_LGLOCK().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0ee8cdfe 15-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

take fget() and friends to fs/file.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4199d35c 16-Mar-2011 Mimi Zohar <zohar@linux.vnet.ibm.com>

vfs: move ima_file_free before releasing the file

ima_file_free(), called on __fput(), currently flags files that have
changed, so that the file is re-measured. For appraising a files's
integrity, the file's hash must be re-calculated and stored in the
'security.ima' xattr to reflect any changes.

This patch moves the ima_file_free() call to before releasing the file
in preparation of ima-appraisal measuring the file and updating the
'security.ima' xattr.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>


# eb04c282 12-Jun-2012 Jan Kara <jack@suse.cz>

fs: Add freezing handling to mnt_want_write() / mnt_drop_write()

Most of places where we want freeze protection coincides with the places where
we also have remount-ro protection. So make mnt_want_write() and
mnt_drop_write() (and their _file alternative) prevent freezing as well.
For the few cases that are really interested only in remount-ro protection
provide new function variants.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5c33b183 20-Jul-2012 Al Viro <viro@zeniv.linux.org.uk>

uninline file_free_rcu()

What inline? Its only use is passing its address to call_rcu(), for fuck sake!

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4a9d4b02 23-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

switch fput to task_work_add

... and schedule_work() for interrupt/kernel_thread callers
(and yes, now it *is* OK to call from interrupt).

We are guaranteed that __fput() will be done before we return
to userland (or exit). Note that for fput() from a kernel
thread we get an async behaviour; it's almost always OK, but
sometimes you might need to have __fput() completed before
you do anything else. There are two mechanisms for that -
a general barrier (flush_delayed_fput()) and explicit
__fput_sync(). Both should be used with care (as was the
case for fput() from kernel threads all along). See comments
in fs/file_table.c for details.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 85d7d618 23-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

mark_files_ro(): don't bother with mntget/mntput

mnt_drop_write_file() is safe under any lock

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 962830df 07-May-2012 Andi Kleen <ak@linux.intel.com>

brlocks/lglocks: API cleanups

lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.

In preparation, this patch changes the API to look more like normal
function calls with pointers, not magic macros.

The patch is rather large because I move over all users in one go to keep
it bisectable. This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# eea62f83 07-May-2012 Andi Kleen <ak@linux.intel.com>

brlocks/lglocks: turn into functions

lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.

Since there are at least two users it makes sense to share this code in a
library. This is also easier maintainable than a macro forest.

This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b57ce969 12-Feb-2012 Al Viro <viro@zeniv.linux.org.uk>

vfs: drop_file_write_access() made static

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8e8b8796 20-Nov-2011 Miklos Szeredi <mszeredi@suse.cz>

vfs: prevent remount read-only if pending removes

If there are any inodes on the super block that have been unlinked
(i_nlink == 0) but have not yet been deleted then prevent the
remounting the super block read-only.

Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 60063497 26-Jul-2011 Arun Sharma <asharma@fb.com>

atomic: use <linux/atomic.h>

This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 60ed8cf7 16-Mar-2011 Miklos Szeredi <miklos@szeredi.hu>

fix cdev leak on O_PATH final fput()

__fput doesn't need a cdev_put() for O_PATH handles.

Signed-off-by: mszeredi@suse.cz
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 326be7b4 13-Mar-2011 Al Viro <viro@zeniv.linux.org.uk>

Allow passing O_PATH descriptors via SCM_RIGHTS datagrams

Just need to make sure that AF_UNIX garbage collector won't
confuse O_PATHed socket on filesystem for real AF_UNIX opened
socket.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1abf0c71 13-Mar-2011 Al Viro <viro@zeniv.linux.org.uk>

New kind of open files - "location only".

New flag for open(2) - O_PATH. Semantics:
* pathname is resolved, but the file itself is _NOT_ opened
as far as filesystem is concerned.
* almost all operations on the resulting descriptors shall
fail with -EBADF. Exceptions are:
1) operations on descriptors themselves (i.e.
close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
fcntl(fd, F_SETFD, ...))
2) fcntl(fd, F_GETFL), for a common non-destructive way to
check if descriptor is open
3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
points of pathname resolution
* closing such descriptor does *NOT* affect dnotify or
posix locks.
* permissions are checked as usual along the way to file;
no permission checks are applied to the file itself. Of course,
giving such thing to syscall will result in permission checks (at
the moment it means checking that starting point of ....at() is
a directory and caller has exec permissions on it).

fget() and fget_light() return NULL on such descriptors; use of
fget_raw() and fget_raw_light() is needed to get them. That protects
existing code from dealing with those things.

There are two things still missing (they come in the next commits):
one is handling of symlinks (right now we refuse to open them that
way; see the next commit for semantics related to those) and another
is descriptor passing via SCM_RIGHTS datagrams.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 890275b5 02-Nov-2010 Mimi Zohar <zohar@linux.vnet.ibm.com>

IMA: maintain i_readcount in the VFS layer

ima_counts_get() updated the readcount and invalidated the PCR,
as necessary. Only update the i_readcount in the VFS layer.
Move the PCR invalidation checks to ima_file_check(), where it
belongs.

Maintaining the i_readcount in the VFS layer, will allow other
subsystems to use i_readcount.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Eric Paris <eparis@redhat.com>


# 78d29788 04-Feb-2011 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>

CRED: Fix kernel panic upon security_file_alloc() failure.

In get_empty_filp() since 2.6.29, file_free(f) is called with f->f_cred == NULL
when security_file_alloc() returned an error. As a result, kernel will panic()
due to put_cred(NULL) call within RCU callback.

Fix this bug by assigning f->f_cred before calling security_file_alloc().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3bc0ba43 13-Dec-2010 Steven Rostedt <srostedt@redhat.com>

fs: Remove unlikely() from fget_light()

There's an unlikely() in fget_light() that assumes the file ref count
will be 1. Running the annotate branch profiler on a desktop that is
performing daily tasks (running firefox, evolution, xchat and is also part
of a distcc farm), it shows that the ref count is not 1 that often.

correct incorrect % Function File Line
------- --------- - -------- ---- ----
1035099358 6209599193 85 fget_light file_table.c 315

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 518de9b3 26-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com>

fs: allow for more than 2^31 files

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:

atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.

get_max_files() is changed to return an unsigned long. get_nr_files() is
changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648

Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7e360c38 05-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com>

fs: allow for more than 2^31 files

Andrew,

Could you please review this patch, you probably are the right guy to
take it, because it crosses fs and net trees.

Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
depend on previous patch (sysctl: fix min/max handling in
__do_proc_doulongvec_minmax())

Thanks !

[PATCH V4] fs: allow for more than 2^31 files

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:

atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of
atomic_t.

get_max_files() is changed to return an unsigned long.
get_nr_files() is changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648

Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6416ccb7 17-Aug-2010 Nick Piggin <npiggin@kernel.dk>

fs: scale files_lock

fs: scale files_lock

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.

Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.

Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

throughput
2.6.34-rc2 24.5
+patch 24.9

us sys idle IO wait (in %)
2.6.34-rc2 51.25 28.25 17.25 3.25
+patch 53.75 18.5 19 8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.

Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.

Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ee2ffa0d 17-Aug-2010 Nick Piggin <npiggin@kernel.dk>

fs: cleanup files_lock locking

fs: cleanup files_lock locking

Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2069601b 12-Aug-2010 Linus Torvalds <torvalds@linux-foundation.org>

Revert "fsnotify: store struct file not struct path"

This reverts commit 3bcf3860a4ff9bbc522820b4b765e65e4deceb3e (and the
accompanying commit c1e5c954020e "vfs/fsnotify: fsnotify_close can delay
the final work in fput" that was a horribly ugly hack to make it work at
all).

The 'struct file' approach not only causes that disgusting hack, it
somehow breaks pulseaudio, probably due to some other subtlety with
f_count handling.

Fix up various conflicts due to later fsnotify work.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 58939473 10-Aug-2010 Tony Battersby <tonyb@cybernetics.com>

vfs: improve comment describing fget_light()

Improve the description of fget_light(), which is currently incorrect
about needing a prior refcnt (judging by the way it is actually used).

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c1e5c954 28-Jul-2010 Eric Paris <eparis@redhat.com>

vfs/fsnotify: fsnotify_close can delay the final work in fput

fanotify almost works like so:

user context calls fsnotify_* function with a struct file.
fsnotify takes a reference on the struct path
user context goes about it's buissiness

at some later point in time the fsnotify listener gets the struct path
fanotify listener calls dentry_open() to create a file which userspace can deal with
listener drops the reference on the struct path
at some later point the listener calls close() on it's new file

With the switch from struct path to struct file this presents a problem for
fput() and fsnotify_close(). fsnotify_close() is called when the filp has
already reached 0 and __fput() wants to do it's cleanup.

The solution presented here is a bit odd. If an event is created from a
struct file we take a reference on the file. We check however if the f_count
was already 0 and if so we take an EXTRA reference EVEN THOUGH IT WAS ZERO.
In __fput() (where we know the f_count hit 0 once) we check if the f_count is
non-zero and if so we drop that 'extra' ref and return without destroying the
file.

Signed-off-by: Eric Paris <eparis@redhat.com>


# d7065da0 26-May-2010 Al Viro <viro@zeniv.linux.org.uk>

get rid of the magic around f_count in aio

__aio_put_req() plays sick games with file refcount. What
it wants is fput() from atomic context; it's almost always
done with f_count > 1, so they only have to deal with delayed
work in rare cases when their reference happens to be the
last one. Current code decrements f_count and if it hasn't
hit 0, everything is fine. Otherwise it keeps a pointer
to struct file (with zero f_count!) around and has delayed
work do __fput() on it.

Better way to do it: use atomic_long_add_unless( , -1, 1)
instead of !atomic_long_dec_and_test(). IOW, decrement it
only if it's not the last reference, leave refcount alone
if it was. And use normal fput() in delayed work.

I've made that atomic_long_add_unless call a new helper -
fput_atomic(). Drops a reference to file if it's safe to
do in atomic (i.e. if that's not the last one), tells if
it had been able to do that. aio.c converted to it, __fput()
use is gone. req->ki_file *always* contributes to refcount
now. And __fput() became static.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 42e49608 05-Mar-2010 Wu Fengguang <fengguang.wu@intel.com>

vfs: take f_lock on modifying f_mode after open time

We'll introduce FMODE_RANDOM which will be runtime modified. So protect
all runtime modification to f_mode with f_lock to avoid races.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> [2.6.33.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 89068c57 07-Feb-2010 Al Viro <viro@zeniv.linux.org.uk>

Take ima_file_free() to proper place.

Hooks: Just Say No.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 385e3ed4 16-Dec-2009 Roland Dreier <rdreier@cisco.com>

alloc_file(): simplify handling of mnt_clone_write() errors

When alloc_file() and init_file() were combined, the error handling of
mnt_clone_write() was taken into alloc_file() in a somewhat obfuscated
way. Since we don't use the error code for anything except warning,
we might as well warn directly without an extra variable.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 73efc468 16-Dec-2009 Roland Dreier <rdreier@cisco.com>

re-export alloc_file()

Commit 3d1e4631 ("get rid of init_file()") removed the export of
alloc_file() -- possibly inadvertently, since that commit mainly
consisted of deleting the lines between the end of alloc_file() and
the start of the code in init_file().

There is in fact one modular use of alloc_file() in the tree, in
drivers/infiniband/core/uverbs_main.c, so re-add the export to fix:

ERROR: "alloc_file" [drivers/infiniband/core/ib_uverbs.ko] undefined!

when CONFIG_INFINIBAND_USER_ACCESS=m.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0552f879 16-Dec-2009 Al Viro <viro@zeniv.linux.org.uk>

Untangling ima mess, part 1: alloc_file()

There are 2 groups of alloc_file() callers:
* ones that are followed by ima_counts_get
* ones giving non-regular files
So let's pull that ima_counts_get() into alloc_file();
it's a no-op in case of non-regular files.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e81e3f4d 04-Dec-2009 Eric Paris <eparis@redhat.com>

fs: move get_empty_filp() deffinition to internal.h

All users outside of fs/ of get_empty_filp() have been removed. This patch
moves the definition from the include/ directory to internal.h so no new
users crop up and removes the EXPORT_SYMBOL. I'd love to see open intents
stop using it too, but that's a problem for another day and a smarter
developer!

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2c48b9c4 08-Aug-2009 Al Viro <viro@zeniv.linux.org.uk>

switch alloc_file() to passing struct path

... and have the caller grab both mnt and dentry; kill
leak in infiniband, while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3d1e4631 08-Aug-2009 Al Viro <viro@zeniv.linux.org.uk>

get rid of init_file()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 73274127 05-Aug-2009 Al Viro <viro@zeniv.linux.org.uk>

unexport get_empty_filp()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6c21a7fb 22-Oct-2009 Mimi Zohar <zohar@linux.vnet.ibm.com>

LSM: imbed ima calls in the security hooks

Based on discussions on LKML and LSM, where there are consecutive
security_ and ima_ calls in the vfs layer, move the ima_ calls to
the existing security_ hooks.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 8d65af78 23-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

sysctl: remove "struct file *" argument of ->proc_handler

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 864d7c4c 26-Apr-2009 npiggin@suse.de <npiggin@suse.de>

fs: move mark_files_ro into file_table.c

This function walks the s_files lock, and operates primarily on the
files in a superblock, so it better belongs here (eg. see also
fs_may_remount_ro).

[AV: ... and it shouldn't be static after that move]

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 96029c4e 26-Apr-2009 npiggin@suse.de <npiggin@suse.de>

fs: introduce mnt_clone_write

This patch speeds up lmbench lat_mmap test by about another 2% after the
first patch.

Before:
avg = 462.286
std = 5.46106

After:
avg = 453.12
std = 9.58257

(50 runs of each, stddev gives a reasonable confidence)

It does this by introducing mnt_clone_write, which avoids some heavyweight
operations of mnt_want_write if called on a vfsmount which we know already
has a write count; and mnt_want_write_file, which can call mnt_clone_write
if the file is open for write.

After these two patches, mnt_want_write and mnt_drop_write go from 7% on
the profile down to 1.3% (including mnt_clone_write).

[AV: mnt_want_write_file() should take file alone and derive mnt from it;
not only all callers have that form, but that's the only mnt about which
we know that it's already held for write if file is opened for write]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a4e49cb6 08-Mar-2009 Tero Roponen <tero.roponen@gmail.com>

trivial: remove unused variable 'path' in alloc_file()

'struct path' is not used in alloc_file().

Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 68499914 06-Feb-2009 Jonathan Corbet <corbet@lwn.net>

Rename struct file->f_ep_lock

This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now,
epoll remains the only user, but a future patch will use it to protect
f_flags as well.

Cc: Davide Libenzi <davidel@xmailserver.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>


# 6146f0d5 04-Feb-2009 Mimi Zohar <zohar@linux.vnet.ibm.com>

integrity: IMA hooks

This patch replaces the generic integrity hooks, for which IMA registered
itself, with IMA integrity hooks in the appropriate places directly
in the fs directory.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# b6b3fdea 10-Dec-2008 Eric Dumazet <dada1@cosmosbay.com>

filp_cachep can be static in fs/file_table.c

Instead of creating the "filp" kmem_cache in vfs_caches_init(),
we can do it a litle be later in files_init(), so that filp_cachep
is static to fs/file_table.c

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d76b0d9b 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Use creds in file structs

Attach creds to file structs and discard f_uid/f_gid.

file_operations::open() methods (such as hppfs_open()) should use file->f_cred
rather than current_cred(). At the moment file->f_cred will be current_cred()
at this point.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>


# 86a264ab 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Wrap current->cred and a few other accessors

Wrap current->cred and a few other accessors to hide their actual
implementation.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# b6dff3ec 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Separate task security context from task_struct

Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>


# 233e70f4 31-Oct-2008 Al Viro <viro@ZenIV.linux.org.uk>

saner FASYNC handling on file close

As it is, all instances of ->release() for files that have ->fasync()
need to remember to evict file from fasync lists; forgetting that
creates a hole and we actually have a bunch that *does* forget.

So let's keep our lives simple - let __fput() check FASYNC in
file->f_flags and call ->fasync() there if it's been set. And lose that
crap in ->release() instances - leaving it there is still valid, but we
don't have to bother anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# aeb5d727 02-Sep-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] introduce fmode_t, do annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 516e0cc5 25-Jul-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] f_count may wrap around

make it atomic_long_t; while we are at it, get rid of useless checks in affs,
hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9f3acc31 24-Apr-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] split linux/file.h

Initial splitoff of the low-level stuff; taken to fdtable.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ad775f5a 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com>

[PATCH] r/o bind mounts: debugging for missed calls

There have been a few oopses caused by 'struct file's with NULL f_vfsmnts.
There was also a set of potentially missed mnt_want_write()s from
dentry_open() calls.

This patch provides a very simple debugging framework to catch these kinds of
bugs. It will WARN_ON() them, but should stop us from having any oopses or
mnt_writer count imbalances.

I'm quite convinced that this is a good thing because it found bugs in the
stuff I was working on as soon as I wrote it.

[hch: made it conditional on a debug option.
But it's still a little bit too ugly]

[hch: merged forced remount r/o fix from Dave and akpm's fix for the fix]

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4a3fd211 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com>

[PATCH] r/o bind mounts: elevate write count for open()s

This is the first really tricky patch in the series. It elevates the writer
count on a mount each time a non-special file is opened for write.

We used to do this in may_open(), but Miklos pointed out that __dentry_open()
is used as well to create filps. This will cover even those cases, while a
call in may_open() would not have.

There is also an elevated count around the vfs_create() call in open_namei().
See the comments for more details, but we need this to fix a 'create, remount,
fail r/w open()' race.

Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# aceaf78d 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com>

[PATCH] r/o bind mounts: create helper to drop file write access

If someone decides to demote a file from r/w to just
r/o, they can use this same code as __fput().

NFS does just that, and will use this in the next
patch.

AV: drop write access in __fput() only after we evict from file list.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 430e285e 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com>

[PATCH] fix up new filp allocators

Some new uses of get_empty_filp() have crept in; switched
to alloc_file() to make sure that pieces of initialization
won't be missing.

We really need to kill get_empty_filp().

[AV] fixed dentry leak on failure exit in anon_inode_getfd()

Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fc9b52cd 08-Feb-2008 Harvey Harrison <harvey.harrison@gmail.com>

fs: remove fastcall, it is always empty

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cfdaf9e5 19-Oct-2007 Matthias Kaehlcke <matthias.kaehlcke@gmail.com>

fs/file_table.c: use list_for_each_entry() instead of list_for_each()

fs/file_table.c: use list_for_each_entry() instead of list_for_each()
in fs_may_remount_ro()

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ce8d2cdf 17-Oct-2007 Dave Hansen <haveblue@us.ibm.com>

r/o bind mounts: filesystem helpers for custom 'struct file's

Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem. In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.

This has a number of uses. It allows chroots to have parts of filesystems
writable. It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems. This also replaces patches that vserver has had out of the
tree for several years.

It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated. I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular bind and a r/o
bind mount of it. It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.

This patch:

Some filesystems forego the vfs and may_open() and create their own 'struct
file's.

This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.

Also, rename an existing, static-scope init_file() to a less generic name.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4975e45f 17-Oct-2007 Denis Cheng <crquan@gmail.com>

fs: use kmem_cache_zalloc instead

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52d9f3b4 17-Oct-2007 Peter Zijlstra <a.p.zijlstra@chello.nl>

lib: percpu_counter_sum_positive

s/percpu_counter_sum/&_positive/

Because its consitent with percpu_counter_read*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e63340ae 08-May-2007 Randy Dunlap <randy.dunlap@oracle.com>

header cleaning: don't include smp_lock.h when not used

Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0f7fc9e4 08-Dec-2006 Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>

[PATCH] VFS: change struct file to use struct path

This patch changes struct file to use struct path instead of having
independent pointers to struct dentry and struct vfsmount, and converts all
users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

Additionally, it adds two #define's to make the transition easier for users of
the f_dentry and f_vfsmnt.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 609d7fa9 02-Oct-2006 Eric W. Biederman <ebiederm@xmission.com>

[PATCH] file: modify struct fown_struct to use a struct pid

File handles can be requested to send sigio and sigurg to processes. By
tracking the destination processes using struct pid instead of pid_t we make
the interface safe from all potential pid wrap around problems.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 577c4eb0 27-Sep-2006 Theodore Ts'o <tytso@mit.edu>

[PATCH] inode-diet: Move i_cdev into a union

Move the i_cdev pointer in struct inode into a union.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 0216bfcf 23-Jun-2006 Mingming Cao <cmm@us.ibm.com>

[PATCH] percpu counter data type changes to suppport more than 2**31 ext3 free blocks counter

The percpu counter data type are changed in this set of patches to support
more users like ext3 who need more than 32 bit to store the free blocks
total in the filesystem.

- Generic perpcu counters data type changes. The size of the global counter
and local counter were explictly specified using s64 and s32. The global
counter is changed from long to s64, while the local counter is changed from
long to s32, so we could avoid doing 64 bit update in most cases.

- Users of the percpu counters are updated to make use of the new
percpu_counter_init() routine now taking an additional parameter to allow
users to pass the initial value of the global counter.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5a6b7951 23-Mar-2006 Benjamin LaHaise <bcrl@linux.intel.com>

[PATCH] get_empty_filp tweaks, inline epoll_init_file()

Eliminate a handful of cache references by keeping current in a register
instead of reloading (helps x86) and avoiding the overhead of a function
call. Inlining eventpoll_init_file() saves 24 bytes. Also reorder file
initialization to make writes occur more sequentially.

Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 529bf6be 07-Mar-2006 Dipankar Sarma <dipankar@in.ibm.com>

[PATCH] fix file counting

I have benchmarked this on an x86_64 NUMA system and see no significant
performance difference on kernbench. Tested on both x86_64 and powerpc.

The way we do file struct accounting is not very suitable for batched
freeing. For scalability reasons, file accounting was
constructor/destructor based. This meant that nr_files was decremented
only when the object was removed from the slab cache. This is susceptible
to slab fragmentation. With RCU based file structure, consequent batched
freeing and a test program like Serge's, we just speed this up and end up
with a very fragmented slab -

llm22:~ # cat /proc/sys/fs/file-nr
587730 0 758844

At the same time, I see only a 2000+ objects in filp cache. The following
patch I fixes this problem.

This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate percpu counter, nr_files, for now and all
accesses to it are through get_nr_files() api. In the sysctl handler for
nr_files, we populate files_stat.nr_files before returning to user.

Counting files as an when they are created and destroyed (as opposed to
inside slab) allows us to correctly count open files with RCU.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 16f7e0fe 11-Jan-2006 Randy Dunlap <rdunlap@infradead.org>

[PATCH] capable/capability.h (fs/)

fs: Use <linux/capability.h> where capable() is used.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 095975da 08-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] rcu file: use atomic primitives

Use atomic_inc_not_zero for rcu files instead of special case rcuref.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2109a2d1 07-Nov-2005 Pekka J Enberg <penberg@cs.Helsinki.FI>

[PATCH] mm: rename kmem_cache_s to kmem_cache

This patch renames struct kmem_cache_s to kmem_cache so we can start using
it instead of kmem_cache_t typedef.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2f512016 30-Oct-2005 Eric Dumazet <dada1@cosmosbay.com>

[PATCH] reduce sizeof(struct file)

Now that RCU applied on 'struct file' seems stable, we can place f_rcuhead
in a memory location that is not anymore used at call_rcu(&f->f_rcuhead,
file_free_rcu) time, to reduce the size of this critical kernel object.

The trick I used is to move f_rcuhead and f_list in an union called f_u

The callers are changed so that f_rcuhead becomes f_u.fu_rcuhead and f_list
becomes f_u.f_list

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ab2af1f5 09-Sep-2005 Dipankar Sarma <dipankar@in.ibm.com>

[PATCH] files: files struct with RCU

Patch to eliminate struct files_struct.file_lock spinlock on the reader side
and use rcu refcounting rcuref_xxx api for the f_count refcounter. The
updates to the fdtable are done by allocating a new fdtable structure and
setting files->fdt to point to the new structure. The fdtable structure is
protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that
are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A
global list of fdtable to be freed is not scalable, so we use a per-cpu list.
If keventd is already handling the current cpu's work, we use a timer to defer
queueing of that work.

Since the last publication, this patch has been re-written to avoid using
explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
premitives instead. This required that the fd information is kept in a
separate structure (fdtable) and updated atomically.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2832e936 06-Sep-2005 Eric Dumazet <dada1@cosmosbay.com>

[PATCH] remove file.f_maxcount

struct file cleanup: f_maxcount has an unique value (INT_MAX). Just use
the hard-wired value.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0eeca283 12-Jul-2005 Robert Love <rml@novell.com>

[PATCH] inotify

inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:

* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?

inotify provides a more usable, simple, powerful solution to file change
notification:

* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.

Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.

See Documentation/filesystems/inotify.txt.

Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# af4d2ecb 23-Jun-2005 Kirill Korotaev <dev@sw.ru>

[PATCH] Fix of bogus file max limit messages

This patch fixes incorrect and bogus kernel messages that file-max limit
reached when the allocation fails

Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Denis Lunev <den@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!