History log of /linux-master/fs/gfs2/inode.c
Revision Date Author Comments
# e9f1e6bb 02-Feb-2024 Andreas Gruenbacher <agruenba@redhat.com>

Revert "gfs2: Use GL_NOBLOCK flag for non-blocking lookups"

Commit "gfs2: Use GL_NOBLOCK flag for non-blocking lookups" has several
issues, some of which are non-trivial to fix, so revert it for now:

https://lore.kernel.org/gfs2/20240202050230.GA875515@ZenIV/T/

This reverts commit dd00aaeb343255a8a30de671bd27bde79a47c8e5.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# dd00aaeb 10-Nov-2023 Abhi Das <adas@redhat.com>

gfs2: Use GL_NOBLOCK flag for non-blocking lookups

Add the GL_NOBLOCK flag to the locking requests in gfs2_permission() and
gfs2_drevalidate() when called with the MAY_NOT_BLOCK flag and
LOOKUP_RCU flag, respectively. This will cause the locking requests to
be handled without sleeping if possible. We bail out with -ECHILD if we
can't grant the glock immediately.

Make sure not to dget() + dput() the parent dentry in gfs2_drevalidate()
in LOOKUP_RCU mode; dput() is a sleeping operation.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 074d7306 30-Oct-2023 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Silence "suspicious RCU usage in gfs2_permission" warning

Commit 0abd1557e21c added rcu_dereference() for dereferencing ip->i_gl
in gfs2_permission. This now causes lockdep to complain when
gfs2_permission is called in non-RCU context:

WARNING: suspicious RCU usage in gfs2_permission

Switch to rcu_dereference_check() and check for the MAY_NOT_BLOCK flag
to shut up lockdep when we know that dereferencing ip->i_gl is safe.

Fixes: 0abd1557e21c ("gfs2: fix an oops in gfs2_permission")
Reported-by: syzbot+3e5130844b0c0e2b4948@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 062fb903 26-Jul-2023 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Rename gfs2_lookup_{ simple => meta }

Function gfs2_lookup_simple() is used for looking up inodes in the
metadata directory tree, so rename it to gfs2_lookup_meta() to closer
match its purpose. Clean the function up a little on the way.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 4c7b3f7f 20-Oct-2023 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Get rid of gfs2_alloc_blocks generation parameter

Get rid of the generation parameter of gfs2_alloc_blocks(): we only ever
set the generation of the current inode while creating it, so do so
directly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 2d8d7990 21-Oct-2023 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: setattr_chown: Add missing initialization

Add a missing initialization of variable ap in setattr_chown().
Without, chown() may be able to bypass quotas.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 0abd1557 01-Oct-2023 Al Viro <viro@zeniv.linux.org.uk>

gfs2: fix an oops in gfs2_permission

In RCU mode, we might race with gfs2_evict_inode(), which zeroes
->i_gl. Freeing of the object it points to is RCU-delayed, so
if we manage to fetch the pointer before it's been replaced with
NULL, we are fine. Check if we'd fetched NULL and treat that
as "bail out and tell the caller to get out of RCU mode".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 21d9067e 20-Sep-2023 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Get rid of the gfs2_glock_is_held_* helpers

Those helpers don't add any clarity and are easy to use wrong. Spell
them out to make more obvious what's happening.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 580f721b 04-Oct-2023 Jeff Layton <jlayton@kernel.org>

gfs2: convert to new timestamp accessors

Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-38-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 111c7d27 26-Jul-2023 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Use mapping->gfp_mask for metadata inodes

Set mapping->gfp mask to GFP_NOFS for all metadata inodes so that
allocating pages in the address space of those inodes won't call back
into the filesystem. This allows to switch back from
find_or_create_page() to grab_cache_page() in two places.

Partially reverts commit 220cca2a4f58 ("GFS2: Change truncate page
allocation to be GFP_NOFS").

Thanks to Dan Carpenter <dan.carpenter@linaro.org> for pointing out a
Smatch static checker warning.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 913e9928 07-Aug-2023 Jeff Layton <jlayton@kernel.org>

fs: drop the timespec64 argument from update_time

Now that all of the update_time operations are prepared for it, we can
drop the timespec64 argument from the update_time operation. Do that and
remove it from some associated functions like inode_update_time and
inode_needs_update_time.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 541d4c79 07-Aug-2023 Jeff Layton <jlayton@kernel.org>

fs: drop the timespec64 arg from generic_update_time

In future patches we're going to change how the ctime is updated
to keep track of when it has been queried. The way that the update_time
operation works (and a lot of its callers) make this difficult, since
they grab a timestamp early and then pass it down to eventually be
copied into the inode.

All of the existing update_time callers pass in the result of
current_time() in some fashion. Drop the "time" parameter from
generic_update_time, and rework it to fetch its own timestamp.

This change means that an update_time could fetch a different timestamp
than was seen in inode_needs_update_time. update_time is only ever
called with one of two flag combinations: Either S_ATIME is set, or
S_MTIME|S_CTIME|S_VERSION are set.

With this change we now treat the flags argument as an indicator that
some value needed to be updated when last checked, rather than an
indication to update specific timestamps.

Rework the logic for updating the timestamps and put it in a new
inode_update_timestamps helper that other update_time routines can use.
S_ATIME is as treated as we always have, but if any of the other three
are set, then we attempt to update all three.

Also, some callers of generic_update_time need to know what timestamps
were actually updated. Change it to return an S_* flag mask to indicate
that and rework the callers to expect it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-3-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 0d72b928 07-Aug-2023 Jeff Layton <jlayton@kernel.org>

fs: pass the request_mask to generic_fillattr

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 8a8b8d91 05-Jul-2023 Jeff Layton <jlayton@kernel.org>

gfs2: convert to ctime accessor functions

In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-45-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 14a58517 14-Mar-2023 Andrew Price <anprice@redhat.com>

gfs2: Remove ghs[] from gfs2_unlink

Replace the 3-item array with three variables for readability.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 2d084780 14-Mar-2023 Andrew Price <anprice@redhat.com>

gfs2: Remove ghs[] from gfs2_link

Replace the 2-item array with two variables for readability.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 8dc14966 14-Mar-2023 Andrew Price <anprice@redhat.com>

gfs2: Remove duplicate i_nlink check from gfs2_link()

The duplication is:

struct gfs2_inode *ip = GFS2_I(inode);
[...]
error = -ENOENT;
if (inode->i_nlink == 0)
goto out_gunlock;
[...]
error = -EINVAL;
if (!ip->i_inode.i_nlink)
goto out_gunlock;

The second check is removed. ENOENT is the correct error code for
attempts to link a deleted inode (ref: link(2)).

If we support O_TMPFILE in future the check will need to be updated with
an exception for inodes flagged I_LINKABLE so sorting out this
duplication now will make it a slightly cleaner change.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 9ffa1888 23-Jan-2023 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: gl_object races fix

Function glock_clear_object() checks if the specified glock is still
pointing at the right object and clears the gl_object pointer. To
handle the case of incompletely constructed inodes, glock_clear_object()
also allows gl_object to be NULL.

However, in the teardown case, when iget_failed() is called and the
inode is removed from the inode hash, by the time we get to the
glock_clear_object() calls in gfs2_put_super() and its helpers, we don't
have exclusion against concurrent gfs2_inode_lookup() and
gfs2_create_inode() calls, and the inode and iopen glocks may already be
pointing at another inode, so the checks in glock_clear_object() are
incorrect.

To better handle this case, always completely disassociate an inode from
its glocks before tearing it down. In addition, get rid of a duplicate
glock_clear_object() call in gfs2_evict_inode(). That way,
glock_clear_object() will only ever be called when the glock points at
the current inode, and the NULL check in glock_clear_object() can be
removed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 4609e1f1 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->permission() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 13e83a49 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->set_acl() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# e18275ae 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->rename() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 5ebb29be 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->mknod() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c54bd91e 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->mkdir() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 7a77db95 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->symlink() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 6c960e68 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->create() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# b74d24f7 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->getattr() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c1632a0f 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->setattr() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 2ec750a0 03-Dec-2022 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Add gfs2_inode_lookup comment

Add comment on when and why gfs2_cancel_delete_work() needs to be
skipped in gfs2_inode_lookup().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 4ec3c19d 05-Nov-2022 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Handle -EBUSY result of insert_inode_locked4

When creating a new inode, there is a small chance that an inode lookup
for a previous version of the same inode is still in progress. In that
case, that previous lookup will eventually fail, but we may still need
to retry here.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 38552ff6 02-Nov-2022 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Fix and clean up create / evict interaction

When gfs2_create_inode() fails after creating a new inode, it uses the
GIF_FREE_VFS_INODE and GIF_ALLOC_FAILED inode flags to communicate to
gfs2_evict_inode() which parts of the inode need to be deallocated and
destroyed. In some error cases, the inode ends up being allocated on
disk and then accidentally left behind. In others, the inode is
partially constructed and then not properly destroyed. Clean this up by
completely handling the inode deallocation and destruction in
gfs2_evict_inode().

This means that gfs2_evict_inode() may now be faced with partially
constructed inodes, so add the necessary checks to cope with that. In
particular, make sure that for incompletely constructed inodes, we're
not accessing the buffers backing the on-disk blocks; the contents may
be undefined.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 3d0258bc 02-Nov-2022 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Clean up initialization of "ip" in gfs2_create_inode

Initialize variable "ip" earlier so that it can be used interchangeably
with "inode" everywhere.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 761fdbbc 04-Nov-2022 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Get rid of ghs[] in gfs2_create_inode

In gfs2_create_inode, get rid of the ghs array in favor of two separate
variables. This makes the code much less irritating.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 35c23fba 02-Nov-2022 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Add extra error check in alloc_dinode

We have reserved the number of blocks we want to allocate, so the actual
allocation isn't expected to fail. Nevertheless, make the code behave
correctly even when things go wrong.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# cac2f8b8 22-Sep-2022 Christian Brauner <brauner@kernel.org>

fs: rename current get acl method

The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

The current inode operation for getting posix acls takes an inode
argument but various filesystems (e.g., 9p, cifs, overlayfs) need access
to the dentry. In contrast to the ->set_acl() inode operation we cannot
simply extend ->get_acl() to take a dentry argument. The ->get_acl()
inode operation is called from:

acl_permission_check()
-> check_acl()
-> get_acl()

which is part of generic_permission() which in turn is part of
inode_permission(). Both generic_permission() and inode_permission() are
called in the ->permission() handler of various filesystems (e.g.,
overlayfs). So simply passing a dentry argument to ->get_acl() would
amount to also having to pass a dentry argument to ->permission(). We
should avoid this unnecessary change.

So instead of extending the existing inode operation rename it from
->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that
passes a dentry argument and which filesystems that need access to the
dentry can implement instead of ->get_inode_acl(). Filesystems like cifs
which allow setting and getting posix acls but not using them for
permission checking during lookup can simply not implement
->get_inode_acl().

This is intended to be a non-functional change.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 138060ba 23-Sep-2022 Christian Brauner <brauner@kernel.org>

fs: pass dentry to set acl method

The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

Since some filesystem rely on the dentry being available to them when
setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode
operation. But since ->set_acl() is required in order to use the generic
posix acl xattr handlers filesystems that do not implement this inode
operation cannot use the handler and need to implement their own
dedicated posix acl handlers.

Update the ->set_acl() inode method to take a dentry argument. This
allows all filesystems to rely on ->set_acl().

As far as I can tell all codepaths can be switched to rely on the dentry
instead of just the inode. Note that the original motivation for passing
the dentry separate from the inode instead of just the dentry in the
xattr handlers was because of security modules that call
security_d_instantiate(). This hook is called during
d_instantiate_new(), d_add(), __d_instantiate_anon(), and
d_splice_alias() to initialize the inode's security context and possibly
to set security.* xattrs. Since this only affects security.* xattrs this
is completely irrelevant for posix acls.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c412a97c 22-Aug-2022 Bob Peterson <rpeterso@redhat.com>

gfs2: Use TRY lock in gfs2_inode_lookup for UNLINKED inodes

Before this patch, delete_work_func() would check for the GLF_DEMOTE
flag on the iopen glock and if set, it would perform special processing.
However, there was a race whereby the GLF_DEMOTE flag could be set by
another process after the check. Then when it called
gfs2_lookup_by_inum() which calls gfs2_inode_lookup(), it tried to lock
the iopen glock in SH mode, but the GLF_DEMOTE flag prevented the
request from being granted. But the iopen glock could never be demoted
because that happens when the inode is evicted, and the evict was never
completed because of the failed lookup.

To fix that, change function gfs2_inode_lookup() so that when
GFS2_BLKST_UNLINKED inodes are searched, it uses the LM_FLAG_TRY flag
for the iopen glock. If the locking request fails, fail
gfs2_inode_lookup() with -EAGAIN so that delete_work_func() can retry
the operation later.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# ebdc416c 05-Apr-2022 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Mark the remaining process-independent glock holders as GL_NOPID

Add the GL_NOPID flag for the remaining glock holders which are not
associated with the current process.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 29464ee3 23-Jan-2022 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Switch lock order of inode and iopen glock

This patch tries to fix the continual ABBA deadlocks we keep having
between the iopen and inode glocks. This switches the lock order in
gfs2_inode_lookup and gfs2_create_inode so the iopen glock is always
locked first.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 7336905a 10-Dec-2021 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: gfs2_setattr_size error path fix

When gfs2_setattr_size() fails, it calls gfs2_rs_delete(ip, NULL) to get
rid of any reservations the inode may have. Instead, it should pass in
the inode's write count as the second parameter to allow
gfs2_rs_delete() to figure out if the inode has any writers left.

In a next step, there are two instances of gfs2_rs_delete(ip, NULL) left
where we know that there can be no other users of the inode. Replace
those with gfs2_rs_deltree(&ip->i_res) to avoid the unnecessary write
count check.

With that, gfs2_rs_delete() is only called with the inode's actual write
count, so get rid of the second parameter.

Fixes: a097dc7e24cb ("GFS2: Make rgrp reservations part of the gfs2_inode structure")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 3d36e57f 30-Nov-2021 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: gfs2_create_inode rework

When gfs2_lookup_by_inum() calls gfs2_inode_lookup() for an uncached
inode, gfs2_inode_lookup() will place a new tentative inode into the
inode cache before verifying that there is a valid inode at the given
address. This can race with gfs2_create_inode() which doesn't check for
duplicates inodes. gfs2_create_inode() will try to assign the new inode
to the corresponding inode glock, and glock_set_object() will complain
that the glock is still in use by gfs2_inode_lookup's tentative inode.

We noticed this bug after adding commit 486408d690e1 ("gfs2: Cancel
remote delete work asynchronously") which allowed delete_work_func() to
race with gfs2_create_inode(), but the same race exists for
open-by-handle.

Fix that by switching from insert_inode_hash() to
insert_inode_locked4(), which does check for duplicate inodes. We know
we've just managed to to allocate the new inode, so an inode tentatively
created by gfs2_inode_lookup() will eventually go away and
insert_inode_locked4() will always succeed.

In addition, don't flush the inode glock work anymore (this can now only
make things worse) and clean up glock_{set,clear}_object for the inode
glock somewhat.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 5f6e13ba 29-Nov-2021 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: gfs2_inode_lookup rework

Rework gfs2_inode_lookup() to only set up the new inode's glocks after
verifying that the new inode is valid.

There is no need for flushing the inode glock work queue anymore now,
so remove that as well.

While at it, get rid of the useless wrapper around iget5_locked() and
its unnecessary is_bad_inode() check.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# b8e12e35 28-Nov-2021 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: gfs2_inode_lookup cleanup

In gfs2_inode_lookup, once the inode has been looked up, we check if the
inode generation (no_formal_ino) is the one we're looking for. If it
isn't and the inode wasn't in the inode cache, we discard the newly
looked up inode. This is unnecessary, complicates the code, and makes
future changes to gfs2_inode_lookup harder, so change the code to retain
newly looked up inodes instead.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 8793e149 01-Oct-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: set glock object after nq

Before this patch, function gfs2_create_inode called glock_set_object to
set the gl_object for inode and iopen glocks before the glock was locked.
That's wrong because other competing processes like evict may be
blocked waiting for the glock and still have gl_object set before the
actual eviction can take place.

This patch moves the call to glock_set_object until after the glock is
acquire in function gfs2_create_inode, so it waits for possibly
competing evicts to finish their processing first.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# ec1d398d 05-Oct-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: Eliminate GIF_INVALID flag

With the addition of the new GLF_INSTANTIATE_NEEDED flag, the
GIF_INVALID flag is now redundant. This patch removes it.
Since inode_instantiate is only called when instantiation is needed,
the check in inode_instantiate is removed too.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# f2e70d8f 06-Oct-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: fix GL_SKIP node_scope problems

Before this patch, when a glock was locked, the very first holder on the
queue would unlock the lockref and call the go_instantiate glops function
(if one existed), unless GL_SKIP was specified. When we introduced the new
node-scope concept, we allowed multiple holders to lock glocks in EX mode
and share the lock.

But node-scope introduced a new problem: if the first holder has GL_SKIP
and the next one does NOT, since it is not the first holder on the queue,
the go_instantiate op was not called. Eventually the GL_SKIP holder may
call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was
still a window of time in which another non-GL_SKIP holder assumes the
instantiate function had been called by the first holder. In the case of
rgrp glocks, this led to a NULL pointer dereference on the buffer_heads.

This patch tries to fix the problem by introducing two new glock flags:

GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function
needs to be called to "fill in" or "read in" the object before it is
referenced.

GLF_INSTANTIATE_IN_PROG which is used to determine when a process is
in the process of reading in the object. Whenever a function needs to
reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if
set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate"
function.

As before, the gl_lockref spin_lock is unlocked during the IO operation,
which may take a relatively long amount of time to complete. While
unlocked, if another process determines go_instantiate is still needed,
it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate
glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared,
it needs to check GLF_INSTANTIATE_NEEDED again because the other process's
go_instantiate operation may not have been successful.

Functions that previously called the instantiate sub-functions now call
directly into gfs2_instantiate so the new bits are managed properly.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 763766c0 29-Sep-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: dequeue iopen holder in gfs2_inode_lookup error

Before this patch, if function gfs2_inode_lookup encountered an error
after it had locked the iopen glock, it never unlocked it, relying on
the evict code to do the cleanup. The evict code then took the
inode glock while holding the iopen glock, which violates the locking
order. For example,

(1) node A does a gfs2_inode_lookup that fails, leaving the iopen glock
locked.

(2) node B calls delete_work_func -> gfs2_lookup_by_inum ->
gfs2_inode_lookup. It locks the inode glock and blocks trying to
lock the iopen glock, which is held by node A.

(3) node A eventually calls gfs2_evict_inode -> evict_should_delete.
It blocks trying to lock the inode glock, which is now held by
node B.

This patch introduces error handling to function gfs2_inode_lookup
so it properly dequeues held iopen glocks on errors.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# d75b9fa0 28-Jul-2021 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Switch to may_setattr in gfs2_setattr

The permission check in gfs2_setattr is an old and outdated version of
may_setattr(). Switch to the updated version.

Fixes fstest generic/079.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e5966cf2 19-Nov-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

gfs2: Fix fall-through warnings for Clang

In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
warnings by explicitly adding multiple goto statements instead of just
letting the code fall through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# c551f66c 30-Mar-2021 Lee Jones <lee.jones@linaro.org>

gfs2: Fix a number of kernel-doc warnings

Building the kernel with W=1 results in a number of kernel-doc warnings
like incorrect function names and parameter descriptions. Fix those,
mostly by adding missing parameter descriptions, removing left-over
descriptions, and demoting some less important kernel-doc comments into
regular comments.

Originally proposed by Lee Jones; improved and combined into a single
patch by Andreas.

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# a4122a95 06-Apr-2021 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Make gfs2_setattr_simple static

This function is only used in inode.c.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 88b631cb 07-Apr-2021 Miklos Szeredi <mszeredi@redhat.com>

gfs2: convert to fileattr

Use the fileattr API to let the VFS handle locking, permission checking and
conversion.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>


# 6f24784f 31-Jan-2021 Al Viro <viro@zeniv.linux.org.uk>

whack-a-mole: don't open-code iminor/imajor

several instances creeped back into the tree...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4fc7ec31 24-Apr-2018 Bob Peterson <rpeterso@redhat.com>

gfs2: Use resource group glock sharing

This patch takes advantage of the new glock holder sharing feature for
resource groups. We have already introduced local resource group
locking in a previous patch, so competing accesses of local processes
are already under control.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 0d56a451 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

stat: handle idmapped mounts

The generic_fillattr() helper fills in the basic attributes associated
with an inode. Enable it to handle idmapped mounts. If the inode is
accessed through an idmapped mount map it into the mount's user
namespace before we store the uid and gid. If the initial user namespace
is passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# e65ce2a5 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

acl: handle idmapped mounts

The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.

The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.

In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.

Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 2f221d6f 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 47291baa 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: make permission helpers idmapped mount aware

The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# a55a47a3 27-Nov-2020 Andreas Gruenbacher <agruenba@redhat.com>

Revert "GFS2: Prevent delete work from occurring on glocks used for create"

Since commit a0e3cc65fa29 ("gfs2: Turn gl_delete into a delayed work"), we're
cancelling any pending delete work of an iopen glock before attaching a new
inode to that glock in gfs2_create_inode. This means that delete_work_func can
no longer be queued or running when attaching the iopen glock to the new inode,
and we can revert commit a4923865ea07 ("GFS2: Prevent delete work from
occurring on glocks used for create"), which tried to achieve the same but in a
racy way.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# e3a77eeb 25-Nov-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Make inode operations static

The inode operations are not used outside inode.c.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# dd0ecf54 30-Nov-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Fix deadlock between gfs2_{create_inode,inode_lookup} and delete_work_func

In gfs2_create_inode and gfs2_inode_lookup, make sure to cancel any pending
delete work before taking the inode glock. Otherwise, gfs2_cancel_delete_work
may block waiting for delete_work_func to complete, and delete_work_func may
block trying to acquire the inode glock in gfs2_inode_lookup.

Reported-by: Alexander Aring <aahringo@redhat.com>
Fixes: a0e3cc65fa29 ("gfs2: Turn gl_delete into a delayed work")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 82e938bd 25-Nov-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Upgrade shared glocks for atime updates

Commit 20f829999c38 ("gfs2: Rework read and page fault locking") lifted
the glock lock taking from the low-level ->readpage and ->readahead
address space operations to the higher-level ->read_iter file and
->fault vm operations. The glocks are still taken in LM_ST_SHARED mode
only. On filesystems mounted without the noatime option, ->read_iter
sometimes needs to update the atime as well, though. Right now, this
leads to a failed locking mode assertion in gfs2_dirty_inode.

Fix that by introducing a new update_time inode operation. There, if
the glock is held non-exclusively, upgrade it to an exclusive lock.

Reported-by: Alexander Aring <aahringo@redhat.com>
Fixes: 20f829999c38 ("gfs2: Rework read and page fault locking")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 6bd1c7bd 02-Nov-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Don't call cancel_delayed_work_sync from within delete work function

Right now, we can end up calling cancel_delayed_work_sync from within
delete_work_func via gfs2_lookup_by_inum -> gfs2_inode_lookup ->
gfs2_cancel_delete_work. When that happens, it will result in a
deadlock. Instead, gfs2_inode_lookup should skip the call to
gfs2_cancel_delete_work when called from delete_work_func (blktype ==
GFS2_BLKST_UNLINKED).

Reported-by: Alexander Ahring Oder Aring <aahringo@redhat.com>
Fixes: a0e3cc65fa29 ("gfs2: Turn gl_delete into a delayed work")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 5902f4dd 09-Jun-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Don't return NULL from gfs2_inode_lookup

Callers expect gfs2_inode_lookup to return an inode pointer or ERR_PTR(error).
Commit b66648ad6dcf caused it to return NULL instead of ERR_PTR(-ESTALE) in
some cases. Fix that.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: b66648ad6dcf ("gfs2: Move inode generation number check into gfs2_inode_lookup")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# b66648ad 14-Jan-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Move inode generation number check into gfs2_inode_lookup

Move the inode generation number check from gfs2_lookup_by_inum into
gfs2_inode_lookup: gfs2_inode_lookup may be able to decide that an inode with
the given inode generation number cannot exist without having to verify the
block type or reading the inode from disk.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 6bdcadea 14-Jan-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Minor gfs2_lookup_by_inum cleanup

Use a zero no_formal_ino instead of a NULL pointer to indicate that any inode
generation number will qualify: a valid inode never has a zero no_formal_ino.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# a0e3cc65 16-Jan-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Turn gl_delete into a delayed work

This requires flushing delayed work items in gfs2_make_fs_ro (which is called
before unmounting a filesystem).

When inodes are deleted and then recreated, pending gl_delete work items would
have no effect because the inode generations will have changed, so we can
cancel any pending gl_delete works before reusing iopen glocks.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 10c5db28 23-May-2020 Christoph Hellwig <hch@lst.de>

fs: move the fiemap definitions out of fs.h

No need to pull the fiemap definitions into almost every file in the
kernel build.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# 1a0b00d1 01-Jun-2020 Bob Peterson <rpeterso@redhat.com>

gfs2: Only do glock put in gfs2_create_inode for free inodes

Before this patch, the error path of function gfs2_create_inode would
always calls gfs2_glock_put for the inode glock. That's good for inodes
that are free. But after they've been added to the vfs inodes, errors
will cause the inode to be evicted, and the evict will do the glock
put for us. If we do a glock put again, we can try to free the glock
while there are still references to it, e.g. revokes pending for
the transaction that created it.

This patch adds a check: if (free_vfs_inode) before the put, thus
solving the problem.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 2297ab61 04-May-2020 Bob Peterson <rpeterso@redhat.com>

gfs2: Fix problems regarding gfs2_qa_get and _put

This patch fixes a couple of places in which gfs2_qa_get and gfs2_qa_put are
not balanced: we now keep references around whenever a file is open for writing
(see gfs2_open_common and gfs2_release), so we need to put all references we
grab in function gfs2_create_inode. This was broken in the successful case and
on one error path.

This also means that we don't have a reference to put in gfs2_evict_inode.

In addition, gfs2_qa_put was called for the wrong inode in gfs2_link.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 1595548f 06-Mar-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Split gfs2_rsqa_delete into gfs2_rs_delete and gfs2_qa_put

Keeping reservations and quotas separate helps reviewing the code.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 2fba46a0 26-Feb-2020 Bob Peterson <rpeterso@redhat.com>

gfs2: Change inode qa_data to allow multiple users

Before this patch, multiple users called gfs2_qa_alloc which allocated
a qadata structure to the inode, if quotas are turned on. Later, in
file close or evict, the structure was deleted with gfs2_qa_delete.
But there can be several competing processes who need access to the
structure. There were races between file close (release) and the others.
Thus, a release could delete the structure out from under a process
that relied upon its existence. For example, chown.

This patch changes the management of the qadata structures to be
a get/put scheme. Function gfs2_qa_alloc has been changed to gfs2_qa_get
and if the structure is allocated, the count essentially starts out at
1. Function gfs2_qa_delete has been renamed to gfs2_qa_put, and the
last guy to decrement the count to 0 frees the memory.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# d580712a 06-Mar-2020 Bob Peterson <rpeterso@redhat.com>

gfs2: eliminate gfs2_rsqa_alloc in favor of gfs2_qa_alloc

Before this patch, multiple callers called gfs2_rsqa_alloc to force
the existence of a reservations structure and a quota data structure
if needed. However, now the reservations are handled separately, so
the quota data is only the quota data. So we eliminate the one in
favor of just calling gfs2_qa_alloc directly.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 40e7e86e 24-Jan-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Clean up inode initialization and teardown

When allocating a new inode, mark the iopen glock holder as uninitialized to
make sure gfs2_evict_inode won't fail after an incomplete create or lookup. In
gfs2_evict_inode, allow the inode glock to be NULL and remove the duplicate
iopen glock teardown code. In gfs2_inode_lookup, don't tear down things that
gfs2_evict_inode will already tear down.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 21039132 10-Mar-2020 Al Viro <viro@zeniv.linux.org.uk>

gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache

with the way fs/namei.c:do_last() had been done, ->atomic_open()
instances needed to recognize the case when existing file got
found with O_EXCL|O_CREAT, either by falling back to finish_no_open()
or failing themselves. gfs2 one didn't.

Fixes: 6d4ade986f9c (GFS2: Add atomic_open support)
Cc: stable@kernel.org # v3.11
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2b0fb353 14-Jan-2020 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Avoid access time thrashing in gfs2_inode_lookup

In gfs2_inode_lookup, we initialize inode->i_atime to the lowest
possibly value after gfs2_inode_refresh may already have been called.
This should be the other way around, but we didn't notice because
usually the inode type is known from the directory entry and so
gfs2_inode_lookup won't call gfs2_inode_refresh.

In addition, only initialize ip->i_no_formal_ino from no_formal_ino when
actually needed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 8f81180a 21-Nov-2019 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Remove duplicate call from gfs2_create_inode

In gfs2_create_inode, gfs2_set_inode_blocks is called twice for no good reason.
Remove the unnecessary call.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 2c47c1be 19-Nov-2019 Bob Peterson <rpeterso@redhat.com>

gfs2: clean up iopen glock mess in gfs2_create_inode

Before this patch, gfs2_create_inode had a use-after-free for the
iopen glock in some error paths because it did this:

gfs2_glock_put(io_gl);
fail_gunlock2:
if (io_gl)
clear_bit(GLF_INODE_CREATING, &io_gl->gl_flags);

In some cases, the io_gl was used for create and only had one
reference, so the glock might be freed before the clear_bit().
This patch tries to straighten it out by only jumping to the
error paths where iopen is properly set, and moving the
gfs2_glock_put after the clear_bit.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 098b9c14 04-Oct-2019 Aliasgar Surti <aliasgar.surti500@gmail.com>

gfs2: removed unnecessary semicolon

There is use of unnecessary semicolon after switch case.
Removed the semicolon.

Signed-off-by: Aliasgar Surti <aliasgar.surti500@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# ad26967b 29-Aug-2019 Bob Peterson <rpeterso@redhat.com>

gfs2: Use async glocks for rename

Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can
reverse the roles of which directories are "old" and which are "new" for
the purposes of rename. This can cause deadlocks where two nodes end up
waiting for each other.

There can be several layers of directory dependencies across many nodes.

This patch fixes the problem by acquiring all gfs2_rename's inode glocks
asychronously and waiting for all glocks to be acquired. That way all
inodes are locked regardless of the order.

The timeout value for multiple asynchronous glocks is calculated to be
the total of the individual wait times for each glock times two.

Since gfs2_exchange is very similar to gfs2_rename, both functions are
patched in the same way.

A new async glock wait queue, sd_async_glock_wait, keeps a list of
waiters for these events. If gfs2's holder_wake function detects an
async holder, it wakes up any waiters for the event. The waiter only
tests whether any of its requests are still pending.

Since the glocks are sent to dlm asychronously, the wait function needs
to check to see which glocks, if any, were granted.

If a glock is granted by dlm (and therefore held), its minimum hold time
is checked and adjusted as necessary, as other glock grants do.

If the event times out, all glocks held thus far must be dequeued to
resolve any existing deadlocks. Then, if there are any outstanding
locking requests, we need to loop around and wait for dlm to respond to
those requests too. After we release all requests, we return -ESTALE to
the caller (vfs rename) which loops around and retries the request.

Node1 Node2
--------- ---------
1. Enqueue A Enqueue B
2. Enqueue B Enqueue A
3. A granted
6. B granted
7. Wait for B
8. Wait for A
9. A times out (since Node 1 holds A)
10. Dequeue B (since it was granted)
11. Wait for all requests from DLM
12. B Granted (since Node2 released it in step 10)
13. Rename
14. Dequeue A
15. DLM Grants A
16. Dequeue A (due to the timeout and since we
no longer have B held for our task).
17. Dequeue B
18. Return -ESTALE to vfs
19. VFS retries the operation, goto step 1.

This release-all-locks / acquire-all-locks may slow rename / exchange
down as both nodes struggle in the same way and do the same thing.
However, this will only happen when there is contention for the same
inodes, which ought to be rare.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# bc74aaef 29-Aug-2019 Bob Peterson <rpeterso@redhat.com>

gfs2: separate holder for rgrps in gfs2_rename

Before this patch, gfs2_rename added a holder for the rgrp glock to
its array of holders, ghs. There's nothing wrong with that, but this
patch separates it into a separate holder. This is done to ensure
it's always locked last as per the proper glock lock ordering,
and also to pave the way for a future patch in which we will
lock the non-rgrp glocks asynchronously.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 15a798f7 05-Jun-2019 Kefeng Wang <wangkefeng.wang@huawei.com>

gfs2: Use IS_ERR_OR_NULL

Use IS_ERR_OR_NULL where appropriate.

(Several more places converted by Andreas.)

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 7336d0e6 31-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 398

Based on 1 normalized pattern(s):

this copyrighted material is made available to anyone wishing to use
modify copy or redistribute it subject to the terms and conditions
of the gnu general public license version 2

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 44 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081038.653000175@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 6ff9b09e 26-Nov-2018 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Get rid of potential double-freeing in gfs2_create_inode

In gfs2_create_inode, after setting and releasing the acl / default_acl, the
acl / default_acl pointers are not set to NULL as they should be. In that
state, when the function reaches label fail_free_acls, gfs2_create_inode will
try to release the same acls again.

Fix that by setting the pointers to NULL after releasing the acls. Slightly
simplify the logic. Also, posix_acl_release checks for NULL already, so
there is no need to duplicate those checks here.

Fixes: e01580bf9e4d ("gfs2: use generic posix ACL infrastructure")
Reported-by: Pan Bian <bianpan2016@163.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 44907d79 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

get rid of 'opened' argument of ->atomic_open() - part 3

now it can be done...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b452a458 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

getting rid of 'opened' argument of ->atomic_open() - part 2

__gfs2_lookup(), gfs2_create_inode(), nfs_finish_open() and fuse_create_open()
don't need 'opened' anymore. Get rid of that argument in those.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# be12af3e 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

getting rid of 'opened' argument of ->atomic_open() - part 1

'opened' argument of finish_open() is unused. Kill it.

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>


# 73a09dd9 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

introduce FMODE_CREATED and switch to it

Parallel to FILE_CREATED, goes into ->f_mode instead of *opened.
NFS is a bit of a wart here - it doesn't have file at the point
where FILE_CREATED used to be set, so we need to propagate it
there (for now). IMA is another one (here and everywhere)...

Note that this needs do_dentry_open() to leave old bits in ->f_mode
alone - we want it to preserve FMODE_CREATED if it had been already
set (no other bit can be there).

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# aad888f8 07-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

switch all remaining checks for FILE_OPENED to FMODE_OPENED

... and don't bother with setting FILE_OPENED at all.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 628e366d 04-Jun-2018 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Iomap cleanups and improvements

Clean up gfs2_iomap_alloc and gfs2_iomap_get. Document how
gfs2_iomap_alloc works: it now needs to be called separately after
gfs2_iomap_get where necessary; this will be used later by iomap write.
Move gfs2_iomap_ops into bmap.c.

Introduce a new gfs2_iomap_get_alloc helper and use it in
fallocate_chunk: gfs2_iomap_begin will become unsuitable for fallocate
with proper iomap write support.

In gfs2_block_map and fallocate_chunk, zero-initialize struct iomap.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 83998ccd 28-Feb-2018 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Dirty source inode during rename

Mark the source inode dirty during a rename instead of just updating the
underlying buffer head. Otherwise, fsync may find the inode clean and
will then skip flushing the journal. A subsequent power failure will
cause the rename to be lost. This happens in command sequences like:

xfs_io -f -c 'pwrite 0 4096' -c 'fsync' foo
mv foo bar
xfs_io -c 'fsync' bar
# power failure

Fixes xfstests generic/322, generic/376.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 2eb5909d 29-Jan-2018 Bob Peterson <rpeterso@redhat.com>

GFS2: Don't try to end a non-existent transaction in unlink

Before this patch, if function gfs2_unlink failed to get a valid
transaction (for example, not enough journal blocks) it would go
to label out_end_trans which did gfs2_trans_end. But if the
trans_begin failed, there's no transaction to end, and trying to
do so results in: kernel BUG at fs/gfs2/trans.c:117!

This patch changes the goto so that it does not try to end a
non-existent transaction.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 235628c5 14-Nov-2017 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Add gfs2_max_stuffed_size

Add a small inline function for computing the maximum size of a stuffed
inode instead of open coding that in several places throughout the code.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# b2623c2f 09-Oct-2017 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Add support for statx inode flags

Add support for the STATX_ATTR_ flags in statx. (Compression,
encryption, and the nodump flag are not supported by gfs2.)

Partially fixes xfstest generic/424.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 3a27411c 15-Mar-2017 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Implement SEEK_HOLE / SEEK_DATA via iomap

So far, lseek on gfs2 did not report holes.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# aac1a55b 16-Feb-2017 Bob Peterson <rpeterso@redhat.com>

GFS2: Switch fiemap implementation to use iomap

This patch switches GFS2's implementation of fiemap from the old
block_map code to the new iomap interface.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>


# 38eedf28 22-Sep-2017 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Support negative atimes

When inodes are read from disk, GFS2 will only update in-memory atimes
older than the on-disk atimes; this prevents atimes from going
backwards. The atimes of newly allocated inodes are initialized to 0.
This means that when an atime is explicitly set to a negative value,
this value will not persist.

Fix by setting the atime of newly allocated inodes to the lowest
possible value instead of 0.

Fixes xfstest generic/258.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 61b91cfd 01-Aug-2017 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Fix trivial typos

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 9c1b2808 17-Jul-2017 Bob Peterson <rpeterso@redhat.com>

GFS2: Clear gl_object if gfs2_create_inode fails

If function gfs2_create_inode fails after the inode has been
created (for example, if the inode_refresh fails for some reason)
the function was setting gl_object but never clearing it again.
The glocks are left pointing to a freed inode. This patch adds
the calls to clear gl_object in the appropriate error paths.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>


# 4d7c18c7 17-Jul-2017 Bob Peterson <rpeterso@redhat.com>

GFS2: Set gl_object in inode lookup only after block type check

Before this patch, the inode glock's gl_object was set after a
reference was acquired, but before the block type was verified.
In cases where the block was unlinked, then freed and reused on
another node, a residule delete callback (delete_work) would try
to look up the inode, eventually failing the block check, but
only after it overwrites gl_object with a pointer to the wrong
inode. This patch moves the assignment of gl_object after the
block check so it won't be improperly overwritten.

Likewise, at the end of the function, gfs2_inode_lookup was
clearing gl_object after it unlocked the glock, which meant
another process might free the glock in the meantime. This
patch guards against that case.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>


# df3d87bd 18-Jul-2017 Bob Peterson <rpeterso@redhat.com>

GFS2: Introduce helper for clearing gl_object

This patch introduces a new helper function in glock.h that
clears gl_object, with an added integrity check. An additional
integrity check has been added to glock_set_object, plus comments.
This is step 1 in a series to ensure gl_object integrity.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>


# 98e5a91a 19-Jul-2017 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Fixup to "Get rid of flush_delayed_work in gfs2_evict_inode"

When commit 4fd1a57952 moved the call to flush_delayed_work from
gfs2_evict_inode to gfs2_inode_lookup to avoid calling into DLM during
evict, a similar call should have been added to gfs2_create_inode:
that's another code path in which glocks of previous inodes may be
reused.

The flush of the iopen glock work queue added by 4fd1a57952, on the
other hand, is unnecessary and can be removed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# e0b62e21 30-Jun-2017 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: gfs2_create_inode: Keep glock across iput

On failure, keep the inode glock across the final iput of the new inode
so that gfs2_evict_inode doesn't have to re-acquire the glock. That
way, gfs2_evict_inode won't need to revalidate the block type.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 6f6597ba 30-Jun-2017 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Protect gl->gl_object by spin lock

Put all remaining accesses to gl->gl_object under the
gl->gl_lockref.lock spinlock to prevent races.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 4fd1a579 30-Jun-2017 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Get rid of flush_delayed_work in gfs2_evict_inode

So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock
work queue to make sure that inode glops which dereference gl->gl_object
have finished running before the inode is destroyed. However, flushing
the work queue may do more work than needed, and in particular, it may
call into DLM, which we want to avoid here. Use a bit lock
(GIF_GLOP_PENDING) to synchronize between the inode glops and
gfs2_evict_inode instead to get rid of the flushing.

In addition, flush the work queues of existing glocks before reusing
them for new inodes to get those glocks into a known state: the glock
state engine currently doesn't handle glock re-appropriation correctly.
(We may be able to fix the glock state engine instead later.)

Based on a patch by Steven Whitehouse <swhiteho@redhat.com>.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# d4da3198 22-Feb-2017 Andreas Gruenbacher <agruenba@redhat.com>

Revert "GFS2: Wait for iopen glock dequeues"

Revert commit 86d067a797d4e8546a7c92b985f31e8cd3ec39ad: it turns out
that waiting for iopen glock dequeues here isn't needed anymore because
the bugs that commit was meant to fix have been fixed otherwise.

In addition, we want to avoid waiting on glocks in gfs2_evict_inode in
shrinker context because the shrinker may be invoked on behalf of DLM,
in which case calling into DLM again would deadlock. This commit makes
the described scenario less likely without completely avoiding it; it's
still a step in the right direction, though.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# cc963a11 16-Mar-2017 Bob Peterson <rpeterso@redhat.com>

GFS2: Temporarily zero i_no_addr when creating a dinode

Before this patch i_no_addr was not initialized until after the
return from allocating its block. That meant the i_no_addr was
temporarily uninitialized storage. Ordinarily that's not a concern,
but if inplace_reserve can't find space, it can call try_rgrp_unlink
which references i_no_addr as a block to avoid. That can result in
unpredictable behavior. More importantly, the trace point in
gfs2_alloc_blocks references ip->i_no_addr before it is set, which
is misleading when reading the kernel traces. This patch makes it
look like the new dinode block was assigned in the name of inode 0
rather than a random inode that's completely unrelated.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# a528d35e 31-Jan-2017 David Howells <dhowells@redhat.com>

statx: Add a system call to make enhanced file info available

Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included. The
following have been included:

(1) Make the fields a consistent size on all arches and make them large.

(2) Spare space, request flags and information flags are provided for
future expansion.

(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).

(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).

This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].

(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).

(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].

Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.

(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).

(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].

(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].

(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...

(This requires a separate system call - I have an fsinfo() call idea
for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].

(Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].

(Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).

(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].

(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.

(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.

(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data. This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};

struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};

The defined bits in request_mask and stx_mask are:

STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:

STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

STATX_ATTR_AUTOMOUNT Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

(0) stx_dev_*, stx_blksize.

These are local system information and are always available.

(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.

These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.

If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.

If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.

Note that there are instances where the type might not be valid, for
instance Windows reparse points.

(2) stx_rdev_*.

This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.

(3) stx_btime.

Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5b825c3a 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>

Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 7c0f6ba6 24-Dec-2016 Linus Torvalds <torvalds@linux-foundation.org>

Replace <asm/uaccess.h> with <linux/uaccess.h> globally

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dfeef688 09-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: remove ".readlink = generic_readlink" assignments

If .readlink == NULL implies generic_readlink().

Generated by:

to_del="\.readlink.*=.*generic_readlink"
for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# fd50ecad 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

vfs: Remove {get,set,remove}xattr inode operations

These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 078cd827 14-Sep-2016 Deepa Dinamani <deepa.kernel@gmail.com>

fs: Replace CURRENT_TIME with current_time() for inode timestamps

CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2773bf00 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

fs: rename "rename2" i_op to "rename"

Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 332f51d7 26-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Initialize atime of I_NEW inodes

Fix for commit 719ee344: initialize atime of I_NEW inodes to 0 so that
the timestamps read from disk will always be more recent than the
initial timestamp, and the atime in the I_NEW inode will be set correctly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 31051c85 26-May-2016 Jan Kara <jack@suse.cz>

fs: Give dentry to inode_change_ok() instead of inode

inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 337684a1 02-Aug-2016 Eryu Guan <guaneryu@gmail.com>

fs: return EPERM on immutable inode

In most cases, EPERM is returned on immutable inode, and there're only a
few places returning EACCES. I noticed this when running LTP on
overlayfs, setxattr03 failed due to unexpected EACCES on immutable
inode.

So converting all EACCES to EPERM on immutable inode.

Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 00699ad8 05-Jul-2016 Al Viro <viro@zeniv.linux.org.uk>

Use the right predicate in ->atomic_open() instances

->atomic_open() can be given an in-lookup dentry *or* a negative one
found in dcache. Use d_in_lookup() to tell one from another, rather
than d_unhashed().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6df9f9a2 17-Jun-2016 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Lock holder cleanup

Make the code more readable by cleaning up the different ways of
initializing lock holders and checking for initialized lock holders:
mark lock holders as uninitialized by setting the holder's glock to NULL
(gfs2_holder_mark_uninitialized) instead of zeroing out the entire
object or using a separate flag. Recognize initialized holders by their
non-NULL glock (gfs2_holder_initialized). Don't zero out holder objects
which are immeditiately initialized via gfs2_holder_init or
gfs2_glock_nq_init.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# cda9dd42 13-Jun-2016 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Large-filesystem fix for 32-bit systems

Commit ff34245d switched from iget5_locked to iget_locked among other
things, but iget_locked doesn't work for filesystems larger than 2^32
blocks on 32-bit systems. Switch back to iget5_locked. Filesystems
larger than 2^32 blocks are unrealistic to work well on 32-bit systems,
so this is mostly a code cleanliness fix.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# ec5ec66b 13-Jun-2016 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Get rid of gfs2_ilookup

Now that gfs2_lookup_by_inum only takes the inode glock for new inodes
(and not for cached inodes anymore), there no longer is a need to
optimize the cached-inode case in gfs2_get_dentry or delete_work_func,
and gfs2_ilookup can be removed.

In addition, gfs2_get_dentry wasn't checking the GFS2_DIF_SYSTEM flag in
i_diskflags in the gfs2_ilookup case (see gfs2_lookup_by_inum); this
inconsistency goes away as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 3ce37b2c 13-Jun-2016 Andreas Gruenbacher <agruenba@redhat.com>

gfs2: Fix gfs2_lookup_by_inum lock inversion

The current gfs2_lookup_by_inum takes the glock of a presumed inode
identified by block number, verifies that the block is indeed an inode,
and then instantiates and reads the new inode via gfs2_inode_lookup.

However, instantiating a new inode may block on freeing a previous
instance of that inode (__wait_on_freeing_inode), and freeing an inode
requires to take the glock already held, leading to lock inversion and
deadlock.

Fix this by first instantiating the new inode, then verifying that the
block is an inode (if required), and then reading in the new inode, all
in gfs2_inode_lookup.

If the block we are looking for is not an inode, we discard the new
inode via iget_failed, which marks inodes as bad and unhashes them.
Other tasks waiting on that inode will get back a bad inode back from
ilookup or iget_locked; in that case, retry the lookup.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 1a39ba99 12-May-2016 Al Viro <viro@zeniv.linux.org.uk>

gfs2: Switch to generic xattr handlers

Switch to the generic xattr handlers and take the necessary glocks at
the layer below. The following are the new xattr "entry points"; they
are called with the glock held already in the following cases:

gfs2_xattr_get: From SELinux, during lookups.
gfs2_xattr_set: The glock is never held.
gfs2_get_acl: From gfs2_create_inode -> posix_acl_create and
gfs2_setattr -> posix_acl_chmod.
gfs2_set_acl: From gfs2_setattr -> posix_acl_chmod.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e97321fa 12-Apr-2016 Bob Peterson <rpeterso@redhat.com>

GFS2: Don't dereference inode in gfs2_inode_lookup until it's valid

Function gfs2_inode_lookup was dereferencing the inode, and after,
it checks for the value being NULL. We need to check that first.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# ce23e640 10-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

->getxattr(): pass dentry and inode as separate arguments

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 73b462d2 18-Dec-2015 Bob Peterson <rpeterso@redhat.com>

GFS2: Eliminate parameter non_block on gfs2_inode_lookup

Now that we're not filtering out I_FREEING inodes from our lookups
anymore, we can eliminate the non_block parameter from the lookup
function.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>


# ff34245d 27-Mar-2015 Bob Peterson <rpeterso@redhat.com>

GFS2: Don't filter out I_FREEING inodes anymore

This patch basically reverts a very old patch from 2008,
7a9f53b3c1875bef22ad4588e818bc046ef183da, with the title
"Alternate gfs2_iget to avoid looking up inodes being freed".
The original patch was designed to avoid a deadlock caused by lock
ordering with try_rgrp_unlink. The patch forced the function to not
find inodes that were being removed by VFS. The problem is, that
made it impossible for nodes to delete their own unlinked dinodes
after a certain point in time, because the inode needed was not found
by this filtering process. There is no longer a need for the patch,
since function try_rgrp_unlink no longer locks the inode: All it does
is queue the glock onto the delete work_queue, so there should be no
more deadlock.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# a4923865 07-Dec-2015 Bob Peterson <rpeterso@redhat.com>

GFS2: Prevent delete work from occurring on glocks used for create

This patch tries to prevent delete work (queued via iopen callback)
from executing if the glock is currently being used to create
a new inode.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>


# 5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fceef393 29-Dec-2015 Al Viro <viro@zeniv.linux.org.uk>

switch ->get_link() to delayed_call, kill ->put_link()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6cc4b6e8 04-Dec-2015 Bob Peterson <rpeterso@redhat.com>

GFS2: Don't do glock put on when inode creation fails

Currently the error path of function gfs2_inode_lookup calls function
gfs2_glock_put corresponding to an earlier call to gfs2_glock_get for
the inode glock. That's wrong because the error path also calls
iget_failed() which eventually calls iput, which eventually calls
gfs2_evict_inode, which does another gfs2_glock_put. This double-put
can cause the glock reference count to get off.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 783013c0 04-Dec-2015 Bob Peterson <rpeterso@redhat.com>

GFS2: Release iopen glock in gfs2_create_inode error cases

Some error cases in gfs2_create_inode were not unlocking the iopen
glock, getting the reference count off. This adds the proper unlock.
The error logic in function gfs2_create_inode was also convoluted,
so this patch simplifies it. It also takes care of a bug in
which gfs2_qa_delete() was not called in an error case.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 86d067a7 07-Dec-2015 Bob Peterson <rpeterso@redhat.com>

GFS2: Wait for iopen glock dequeues

This patch changes every glock_dq for iopen glocks into a dq_wait.
This makes sure that iopen glocks do not outlive the inode itself.
In turn, that ensures that anyone trying to unlink the glock will
be able to find the inode when it receives a remote iopen callback.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>


# a097dc7e 16-Jul-2015 Bob Peterson <rpeterso@redhat.com>

GFS2: Make rgrp reservations part of the gfs2_inode structure

Before this patch, multi-block reservation structures were allocated
from a special slab. This patch folds the structure into the gfs2_inode
structure. The disadvantage is that the gfs2_inode needs more memory,
even when a file is opened read-only. The advantages are: (a) we don't
need the special slab and the extra time it takes to allocate and
deallocate from it. (b) we no longer need to worry that the structure
exists for things like quota management. (c) This also allows us to
remove the calls to get_write_access and put_write_access since we
know the structure will exist.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 6b255391 17-Nov-2015 Al Viro <viro@zeniv.linux.org.uk>

replace ->follow_link() with new method that could stay in RCU mode

new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.

It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b54e9a0b 26-Oct-2015 Bob Peterson <rpeterso@redhat.com>

GFS2: Extract quota data from reservations structure (revert 5407e24)

This patch basically reverts the majority of patch 5407e24.
That patch eliminated the gfs2_qadata structure in favor of just
using the reservations structure. The problem with doing that is that
it increases the size of the reservations structure. That is not an
issue until it comes time to fold the reservations structure into the
inode in memory so we know it's always there. By separating out the
quota structure again, we aren't punishing the non-quota users by
making all the inodes bigger, requiring more slab space. This patch
creates a new slab area to allocate the quota stuff so it's managed
a little more sanely.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# acc546fd 10-Nov-2015 Abhi Das <adas@redhat.com>

gfs2: Automatically set GFS2_DIF_SYSTEM flag on system files

When new files and directories are created inside a parent directory
we automatically inherit the GFS2_DIF_SYSTEM flag (if set) and assign
it to the new file/dirs.

All new system files/dirs created in the metafs by, say gfs2_jadd,
will have this flag set because they will have parent directories in
the metafs whose GFS2_DIF_SYSTEM flag has already been set (most likely
by a previous mkfs.gfs2)

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 39a72580 02-Jun-2015 Abhi Das <adas@redhat.com>

gfs2: fix quota updates on block boundaries

For smaller block sizes (512B, 1K, 2K), some quotas straddle block
boundaries such that the usage value is on one block and the rest
of the quota is on the previous block. In such cases, the value
does not get updated correctly. This patch fixes that by addressing
the boundary conditions correctly.

This patch also adds a (s64) cast that was missing in a call to
gfs2_quota_change() in inode.c

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 6e77137b 02-May-2015 Al Viro <viro@zeniv.linux.org.uk>

don't pass nameidata to ->follow_link()

its only use is getting passed to nd_jump_link(), which can obtain
it from current->nameidata

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 680baacb 02-May-2015 Al Viro <viro@zeniv.linux.org.uk>

new ->follow_link() and ->put_link() calling conventions

a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks. Stored pointer
is ignored in all cases except the last one.

Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().

b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is. In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a63b7bbc 04-May-2015 Benjamin Marzinski <bmarzins@redhat.com>

GFS2: add support for rename2 and RENAME_EXCHANGE

gfs2 now uses the rename2 directory iop, and supports the
RENAME_EXCHANGE flag (as well as RENAME_NOREPLACE, which the vfs
takes care of).

Signed-off-by: Benjamin Marzinski <bmarzins redhat com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 86fbca49 30-Apr-2015 Antonio Ospite <ao2@ao2.it>

GFS2: inode.c: indent with TABs, not spaces

Follow the same style used for the other functions in the same file.

Signed-off-by: Antonio Ospite <ao2@ao2.it>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 2b0143b5 17-Mar-2015 David Howells <dhowells@redhat.com>

VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b8fbf471 17-Mar-2015 Abhi Das <adas@redhat.com>

gfs2: perform quota checks against allocation parameters

Use struct gfs2_alloc_parms as an argument to gfs2_quota_check()
and gfs2_quota_lock_check() to check for quota violations while
accounting for the new blocks requested by the current operation
in ap->target.

Previously, the number of new blocks requested during an operation
were not accounted for during quota_check and would allow these
operations to exceed quota. This was not very apparent since most
operations allocated only 1 block at a time and quotas would get
violated in the next operation. i.e. quota excess would only be by
1 block or so. With fallocate, (where we allocate a bunch of blocks
at once) the quota excess is non-trivial and is addressed by this
patch.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>


# 45094a58 22-Jan-2015 Bob Peterson <rpeterso@redhat.com>

GFS2: Eliminate a nonsense goto


This patch just removes a goto that did nothing.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ec7d879c 19-Nov-2014 Al Viro <viro@ZenIV.linux.org.uk>

GFS2: gfs2_atomic_open(): simplify the use of finish_no_open()

In ->atomic_open(inode, dentry, file, opened) calling finish_no_open(file, NULL)
is equivalent to dget(dentry); return finish_no_open(file, dentry);

No need to open-code that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 44bb31ba 19-Nov-2014 Al Viro <viro@ZenIV.linux.org.uk>

GFS2: gfs2_create_inode(): don't bother with d_splice_alias()

dentry is always hashed and negative, inode - non-error, non-NULL and
non-directory. In such conditions d_splice_alias() is equivalent to
"d_instantiate(dentry, inode) and return NULL", which simplifies the
downstream code and is consistent with the "have to create a new object"
case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 571a4b57 19-Nov-2014 Al Viro <viro@ZenIV.linux.org.uk>

GFS2: bugger off early if O_CREAT open finds a directory

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 845409b4 18-Nov-2014 Al Viro <viro@zeniv.linux.org.uk>

gfs2_atomic_open(): simplify the use of finish_no_open()

In ->atomic_open(inode, dentry, file, opened) calling finish_no_open(file, NULL)
is equivalent to dget(dentry); return finish_no_open(file, dentry);

No need to open-code that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 81295ce6 18-Sep-2014 Al Viro <viro@zeniv.linux.org.uk>

gfs2_create_inode(): don't bother with d_splice_alias()

dentry is always hashed and negative, inode - non-error, non-NULL and
non-directory. In such conditions d_splice_alias() is equivalent to
"d_instantiate(dentry, inode) and return NULL", which simplifies the
downstream code and is consistent with the "have to create a new object"
case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 986cdb86 18-Sep-2014 Al Viro <viro@zeniv.linux.org.uk>

gfs2: bugger off early if O_CREAT open finds a directory

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2e60d768 13-Nov-2014 Benjamin Marzinski <bmarzins@redhat.com>

GFS2: update freeze code to use freeze/thaw_super on all nodes

The current gfs2 freezing code is considerably more complicated than it
should be because it doesn't use the vfs freezing code on any node except
the one that begins the freeze. This is because it needs to acquire a
cluster glock before calling the vfs code to prevent a deadlock, and
without the new freeze_super and thaw_super hooks, that was impossible. To
deal with the issue, gfs2 had to do some hacky locking tricks to make sure
that a frozen node couldn't be holding on a lock it needed to do the
unfreeze ioctl.

This patch makes use of the new hooks to simply the gfs2 locking code. Now,
all the nodes in the cluster freeze and thaw in exactly the same way. Every
node in the cluster caches the freeze glock in the shared state. The new
freeze_super hook allows the freezing node to grab this freeze glock in
the exclusive state without first calling the vfs freeze_super function.
All the nodes in the cluster see this lock change, and call the vfs
freeze_super function. The vfs locking code guarantees that the nodes can't
get stuck holding the glocks necessary to unfreeze the system. To
unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
glock. Again, all the nodes notice this, reacquire the glock in shared mode
and call the vfs thaw_super function.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 37975f15 09-Oct-2014 Fabian Frederick <fabf@skynet.be>

GFS2: directly return gfs2_dir_check()

No need to store gfs2_dir_check result and test it before returning.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 4d93bc3e 12-Sep-2014 Al Viro <viro@zeniv.linux.org.uk>

gfs2_atomic_open(): skip lookups on hashed dentry

hashed dentry can be passed to ->atomic_open() only if
a) it has just passed revalidation and
b) it's negative

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 19aeb5a6 29-Sep-2014 Bob Peterson <rpeterso@redhat.com>

GFS2: Make rename not save dirent location

This patch fixes a regression in the patch "GFS2: Remember directory
insert point", commit 2b47dad866d04f14c328f888ba5406057b8c7d33.
The problem had to do with the rename function: The function found
space for the new dirent, and remembered that location. But then the
old dirent was removed, which often moved the eligible location for
the renamed dirent. Putting the new dirent at the saved location
caused file system corruption.

This patch adds a new "save_loc" variable to struct gfs2_diradd.
If 1, the dirent location is saved. If 0, the dirent location is not
saved and the buffer_head is released as per previous behavior.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 00a158be 18-Sep-2014 Abhi Das <adas@redhat.com>

GFS2: fix bad inode i_goal values during block allocation

This patch checks if i_goal is either zero or if doesn't exist
within any rgrp (i.e gfs2_blk2rgrpd() returns NULL). If so, it
assigns the ip->i_no_addr block as the i_goal.

There are two scenarios where a bad i_goal can result in a
-EBADSLT error.

1. Attempting to allocate to an existing inode:
Control reaches gfs2_inplace_reserve() and ip->i_goal is bad.
We need to fix i_goal here.

2. A new inode is created in a directory whose i_goal is hosed:
In this case, the parent dir's i_goal is copied onto the new
inode. Since the new inode is not yet created, the ip->i_no_addr
field is invalid and so, the fix in gfs2_inplace_reserve() as per
1) won't work in this scenario. We need to catch and fix it sooner
in the parent dir itself (gfs2_create_inode()), before it is
copied to the new inode.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# cfb2f9d5 12-Sep-2014 Al Viro <viro@ZenIV.linux.org.uk>

GFS2: fix d_splice_alias() misuses

Callers of d_splice_alias(dentry, inode) don't need iput(), neither
on success nor on failure. Either the reference to inode is stored
in a previously negative dentry, or it's dropped. In either case
inode reference the caller used to hold is consumed.

__gfs2_lookup() does iput() in case when d_splice_alias() has failed.
Double iput() if we ever hit that. And gfs2_create_inode() ends up
not only with double iput(), but with link count dropped to zero - on
an inode it has just found in directory.

Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 7b7a9115 10-Sep-2014 Benjamin Coddington <bcodding@redhat.com>

GFS2: Hash the negative dentry during inode lookup

Fix a regression introduced by:
6d4ade986f9c8df31e68 GFS2: Add atomic_open support
where an early return misses d_splice_alias() which had been
adding the negative dentry.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 24972557 01-May-2014 Benjamin Marzinski <bmarzins@redhat.com>

GFS2: remove transaction glock

GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.

This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.

When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.

However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.

In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.

The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 05978803 31-Mar-2014 Abhi Das <adas@redhat.com>

GFS2: Fix uninitialized VFS inode in gfs2_create_inode

When gfs2_create_inode() fails due to quota violation, the VFS
inode is not completely uninitialized. This can cause a list
corruption error.

This patch correctly uninitializes the VFS inode when a quota
violation occurs in the gfs2_create_inode codepath.

Resolves: rhbz#1059808
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# f45dc26d 19-Mar-2014 Bob Peterson <rpeterso@redhat.com>

GFS2: Remove extraneous function gfs2_security_init

This patch eliminates function gfs2_security_init in favor of just
calling security_inode_init_security directly.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 48f8f711 12-Mar-2014 Abhi Das <adas@redhat.com>

GFS2: check NULL return value in gfs2_ok_to_move

gfs2_lookupi() can return NULL if the path to the root is broken by
another rename/rmdir. In this case gfs2_ok_to_move() must check for
this NULL pointer and return error.

Resolves: rhbz#1060246
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# b2c8b3ea 04-Feb-2014 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Allocate block for xattr at inode alloc time, if required

This is another step towards improving the allocation of xattr
blocks at inode allocation time. Here we take advantage of
Christoph's recent work on ACLs to allocate a block for the
xattrs early if we know that we will be adding ACLs to the
inode later on. The advantage of that is that it is much
more likely that we'll get a contiguous run of two blocks
where the first is the inode and the second is the xattr block.

We still have to fall back to the original system in case we
don't get the requested two contiguous blocks, or in case the
ACLs are too large to fit into the block.

Future patches will move more of the ACL setting code further
up the gfs2_inode_create() function. Also, I'd like to be
able to do the same thing with the xattrs from LSMs in
due course, too. That way we should be able to slowly reduce
the number of independent transactions, at least in the
most common cases.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# e01580bf 20-Dec-2013 Christoph Hellwig <hch@infradead.org>

gfs2: use generic posix ACL infrastructure

This contains some major refactoring for the create path so that
inodes are created with the right mode to start with instead of
fixing it up later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d57b9c9a 16-Jan-2014 J. Bruce Fields <bfields@fieldses.org>

GFS2: revert "GFS2: d_splice_alias() can't return error"

0d0d110720d7960b77c03c9f2597faaff4b484ae asserts that "d_splice_alias()
can't return error unless it was given an IS_ERR(inode)".

That was true of the implementation of d_splice_alias, but this is
really a problem with d_splice_alias: at a minimum it should be able to
return -ELOOP in the case where inserting the given dentry would cause a
directory loop.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ac3beb6a 16-Jan-2014 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Don't use ENOBUFS when ENOMEM is the correct error code

Al Viro has tactfully pointed out that we are using the incorrect
error code in some cases. This patch fixes that, and also removes
the (unused) return value for glock dumping.

> * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
> "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
> we don't support STREAMS it's probably fair game, but... what the hell?

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>


# 62e96cf8 06-Jan-2014 Bob Peterson <rpeterso@redhat.com>

GFS2: Increase i_writecount during gfs2_setattr_chown

This patch calls get_write_access in function gfs2_setattr_chown,
which merely increases inode->i_writecount for the duration of the
function. That will ensure that any file closes won't delete the
inode's multi-block reservation while the function is running.
It also ensures that a multi-block reservation exists when needed
for quota change operations during the chown.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 2b47dad8 05-Jan-2014 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Remember directory insert point

When we look to see if there is enough space to add a dir
entry without allocation, we have then been repeating the
same search later when we do the actual insertion. This
patch caches the details of the location in the gfs2_diradd
structure, so that we do not have to repeat the search.

This will provide a performance improvement which will be
greater as the size of the directory increases.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 534cf9ca 05-Jan-2014 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Consolidate transaction blocks calculation for dir add

There are three cases where we need to calculate the number of
blocks to reserve in a transaction involving linking an inode
into a directory. The one in rename is a bit more complicated,
but the basis of it is the same as for link and create. So it
makes sense to move this calculation into a single function
rather than repeating it three times.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 3c1c0ae1 06-Jan-2014 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Add directory addition info structure

The intent is that this structure will hold the information
required when adding entries to a directory (linking). To
start with, it will contain only the number of blocks which
are required to link the new entry into the directory. The
current calculation returns either 0 or the maximim number of
blocks that can ever be requested by such a transaction.

The intent is that in a later patch, we can update the dir
code to calculate this value more accurately. In addition
further patches will also add further fields to the new
structure to increase its utility.

In addition this patch fixes a bug where the link used during
inode creation was adding requesting too many blocks in
some cases. This is harmless unless the fs is close to being
full.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ea0341e0 21-Nov-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Fix ref count bug relating to atomic_open

In the case that atomic_open calls finish_no_open() with
the dentry that was supplied to gfs2_atomic_open() an
extra reference count is required. This patch fixes that
issue preventing a bug trap triggering at umount time.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 87dc800b 16-Sep-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: kfree_put_link()

duplicated to hell and back...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7b9cff46 02-Oct-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Add allocation parameters structure

This patch adds a structure to contain allocation parameters with
the intention of future expansion of this structure. The idea is
that we should be able to add more information about the allocation
in the future in order to allow the allocator to make a better job
of placing the requests on-disk.

There is no functional difference from applying this patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# af5c2697 26-Sep-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Clean up reservation removal

The reservation for an inode should be cleared when it is truncated so
that we can start again at a different offset for future allocations.
We could try and do better than that, by resetting the search based on
where the truncation started from, but this is only a first step.

In addition, there are three callers of gfs2_rs_delete() but only one
of those should really be testing the value of i_writecount. While
we get away with that in the other cases currently, I think it would
be better if we made that test specific to the one case which
requires it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 5ca1db41 23-Sep-2013 Miklos Szeredi <miklos@szeredi.hu>

GFS2: fix dentry leaks

We need to dput() the result of d_splice_alias(), unless it is passed to
finish_no_open().

Edited by Steven Whitehouse in order to make it apply to the current
GFS2 git tree, and taking account of a prerequisite patch which hasn't
been applied.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: stable@vger.kernel.org


# 0d0d1107 16-Sep-2013 Miklos Szeredi <mszeredi@suse.cz>

GFS2: d_splice_alias() can't return error

unless it was given an IS_ERR(inode), which isn't the case here. So clean
up the unnecessary error handling in gfs2_create_inode().

This paves the way for real fixes (hence the stable Cc).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: stable@vger.kernel.org


# c5bf8fef 16-Sep-2013 Miklos Szeredi <mszeredi@suse.cz>

gfs2: set FILE_CREATED

In gfs2_create_inode() set FILE_CREATED in *opened.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7bd9ee58 16-Aug-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Check for glock already held in gfs2_getxattr

Since the introduction of atomic_open, gfs2_getxattr can be
called with the glock already held, so we need to allow for
this.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Tested-by: David Teigland <teigland@redhat.com>


# 2523d47a 17-Jul-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Fix typo in gfs2_create_inode()

PTR_RET should be PTR_ERR

Reported-by: Sachin Kamat <sachin.kamat@linaro.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ad151d54 17-Jul-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO

PTR_RET should be PTR_ERR

Reported-by: Sachin Kamat <sachin.kamat@linaro.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


# 6d4ade98 14-Jun-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Add atomic_open support

I've restricted atomic_open to only operate on regular files, although
I still don't understand why atomic_open should not be possible also for
directories on GFS2. That can always be added in later though, if it
makes sense.

The ->atomic_open function can be passed negative dentries, which
in most cases means either ENOENT (->lookup) or a call to d_instantiate
(->create). In the GFS2 case though, we need to actually perform the
look up, since we do not know whether there has been a new inode created
on another node. The look up calls d_splice_alias which then tries to
rehash the dentry - so the solution here is to simply check for that
in d_splice_alias. The same issue is likely to affect any other cluster
filesystem implementing ->atomic_open

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields fieldses org>
Cc: Jeff Layton <jlayton@redhat.com>


# 5a00f3cc 11-Jun-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Only do one directory search on create

Creation of a new inode requires a directory search in order to ensure
that we are not trying to create an inode with the same name as an
existing one. This was hidden away inside the create_ok() function.

In the case that there was an existing inode, and a lookup can be
substituted for a create (which is the case with regular files
when the O_EXCL flag is not in use) then we were doing a second
lookup in order to return the inode.

This patch merges these two lookups into one. This can be done by
passing a flag to gfs2_dir_search() to tell it to just return -EEXIST
in the cases where we don't actually want to look up the inode.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 2b6f8860 01-Jun-2013 Thomas Meyer <thomas@m3y3r.de>

GFS2: Cocci spatch "ptr_ret.spatch"

Use PTR_RET in place of open coding this function.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# a6a4d98b 29-May-2013 Bob Peterson <rpeterso@redhat.com>

GFS2: Don't cache iopen glocks

This patch makes GFS2 immediately reclaim/delete all iopen glocks
as soon as they're dequeued. This allows deleters to get an
EXclusive lock on iopen so files are deleted properly instead of
being set as unlinked.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 79ba7480 01-Mar-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Use gfs2_dinode_out() in the inode create path

Over the previous two patches relating to inode creation, the
content of init_dinode() has been looking more and more like
gfs2_dinode_out(). This is not an accident! This patch replaces
the parts of init_dinode() which are duplicated in gfs2_dinode_out()
with a call to that function.

Mostly that is straightforward, but there is one issue which needed
to be resolved relating to the link count. The link count has to be
set to zero in a certain error handling code path, which lands up
calling iput(). This is now done specifically in that code path
allowing the link count to be set earlier and written into the
on disk inode by gfs2_dinode_put() in the normal way.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 28fb3027 26-Feb-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Remove gfs2_refresh_inode from inode creation path

The original method for creating inodes used in GFS2 was to fill
out a buffer, with all the information, and then to read that
buffer into the in-core inode, using gfs2_refresh_inode()

The problem with this approach is that all the inode's fields
need to be calculated ahead of time, and were stored in various
variables making the code rather complicated.

The new approach is simply to allocate the in-core inode earlier
and fill in as many fields as possible ahead of time. These can
then be used to initilise the on disk representation. The
code has been working towards the point where it is possible
to remove gfs2_refresh_inode() because all the fields are
correctly initialised ahead of time. We've now reached that
milestone, and have reversed the order of setting up the in
core and on disk inodes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# fd4b4e04 26-Feb-2013 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Clean up inode creation path

This patch cleans up the inode creation code path in GFS2. After the
Orlov allocator was merged, a number of potential improvements are
now possible, and this is a first set of these.

The quota handling is now updated so that it matches the point in
the code where the allocation takes place. This means that the one
exception in gfs2_alloc_blocks relating to quota is now no longer
required, and we can use the generic code everywhere.

In addition the call to figure out whether we need to allocate any
extra blocks in order to add a directory entry is moved higher up
gfs2_create_inode. This means that if it returns an error, we
can deal with that at a stage where it is easier to handle that case.
The returned status cannot change during the function since we hold
an exclusive lock on the directory.

Two calls to gfs2_rindex_update have been changed to one, again at
the top of gfs2_create_inode to simplify error handling.

The time stamps are also now initialised earlier in the creation
process, this is gradually moving towards being able to remove the
call to gfs2_refresh_inode in gfs2_inode_create once we have all the
fields covered.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# d0546426 31-Jan-2013 Eric W. Biederman <ebiederm@xmission.com>

gfs2: Convert uids and gids between dinodes and vfs inodes.

When reading dinodes from the disk convert uids and gids
into kuids and kgids to store in vfs data structures.

When writing to dinodes to the disk convert kuids and kgids
in the in memory structures into plain uids and gids.

For now all on disk data structures are assumed to be
stored in the initial user namespace.

Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 6b24c0d2 31-Jan-2013 Eric W. Biederman <ebiederm@xmission.com>

gfs2: Use uid_eq and gid_eq where appropriate

Where kuid_t values are compared use uid_eq and where kgid_t values
are compared use gid_eq. This is unfortunately necessary because
of the type safety that keeps someone from accidentally mixing
kuids and kgids with other types.

Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 7c06b5d6 31-Jan-2013 Eric W. Biederman <ebiederm@xmission.com>

gfs2: Use kuid_t and kgid_t types where appropriate.

Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# f4108a60 31-Jan-2013 Eric W. Biederman <ebiederm@xmission.com>

gfs2: Split NO_QUOTA_CHANGE inot NO_UID_QUTOA_CHANGE and NO_GID_QUTOA_CHANGE

Split NO_QUOTA_CHANGE into NO_UID_QUTOA_CHANGE and NO_GID_QUTOA_CHANGE
so the constants may be well typed.

Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 350a9b0a 13-Dec-2012 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Split gfs2_trans_add_bh() into two

There is little common content in gfs2_trans_add_bh() between the data
and meta classes by the time that the functions which it calls are
taken into account. The intent here is to split this into two
separate functions. Stage one is to introduce gfs2_trans_add_data()
and gfs2_trans_add_meta() and update the callers accordingly.

Later patches will then pull in the content of gfs2_trans_add_bh()
and its dependent functions in order to clean up the code in this
area.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 1e2d9d44 21-Nov-2012 Bob Peterson <rpeterso@redhat.com>

GFS2: Set gl_object during inode create

This patch fixes a cluster coherency problem that occurs when one
node creates a file, does several writes, then a different node
tries to write to the same file. When the inode's glock is demoted,
the inode wasn't synced to the media properly because the gl_object
wasn't set. Later, the flush daemon noticed the uncommitted data
and tried to flush it, only to discover the glock was no longer locked
properly in exclusive mode. That caused an assert withdraw.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# be4f245d 16-Nov-2012 Bob Peterson <rpeterso@redhat.com>

GFS2: add error check while allocating new inodes

This patch adds a return code check after attempting to allocate
a new inode during dinode creation.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 4327a9bf 12-Nov-2012 Bob Peterson <rpeterso@redhat.com>

GFS2: Eliminate redundant buffer_head manipulation in gfs2_unlink_inode

Since we now have a dirty_inode that takes care of manipulating the
inode buffer and writing from the inode to the buffer, we can
eliminate some unnecessary buffer manipulations in gfs2_unlink_inode
that are now redundant.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 9dbe9610 31-Oct-2012 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Add Orlov allocator

Just like ext3, this works on the root directory and any directory
with the +T flag set. Also, just like ext3, any subdirectory created
in one of the just mentioned cases will be allocated to a random
resource group (GFS2 equivalent of a block group).

If you are creating a set of directories, each of which will contain a
job running on a different node, then by setting +T on the parent
directory before creating the subdirectories, each will land up in a
different resource group, and thus resource group contention between
nodes will be kept to a minimum.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# c9aecf73 31-Oct-2012 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Use proper allocation context for new inodes

Rather than using the parent directory's allocation context, this
patch allocated the new inode earlier in the process and then uses
it to contain all the information required. As a result, we can now
use the new inode's own allocation context to allocate it rather
than having to use the parent directory's context. This give us a
lot more flexibility in where the inode is placed on disk.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ff7f4cb4 10-Sep-2012 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Consolidate free block searching functions

With the recently added block reservation code, an additional function
was added to search for free blocks. This had a restriction of only being
able to search for aligned extents of free blocks. As a result the
allocation patterns when reserving blocks were suboptimal when the
existing allocation of blocks for an inode was not aligned to the same
boundary.

This patch resolves that problem by adding the ability for gfs2_rbm_find
to search for extents of a particular minimum size. We can then use
gfs2_rbm_find for both looking for reservations, and also looking for
free blocks on an individual basis when we actually come to do the
allocation later on. As a result we only need a single set of code
to deal with both situations.

The function gfs2_rbm_from_block() is moved up rgrp.c so that it
occurs before all of its callers.

Many thanks are due to Bob for helping track down the final issue in
this patch. That fix to the rb_tree traversal and to not share
block reservations from a dirctory to its children is included here.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 71f890f7 30-Jul-2012 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Remove rs_requested field from reservations

The rs_requested field is left over from the original allocation
code, however this should have been a parameter passed to the
various functions from gfs2_inplace_reserve() and not a member of the
reservation structure as the value is not required after the
initial allocation.

This also helps simplify the code since we no longer need to set
the rs_requested to zero. Also the gfs2_inplace_release()
function can also be simplified since the reservation structure
will always be defined when it is called, and the only remaining
task is to unlock the rgrp if required. It can also now be
called unconditionally too, resulting in a further simplification.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 645b2ccc 26-Jul-2012 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Fix missing allocation data for set/remove xattr

These entry points were missed in the original patch to allocate
this data structure.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 8e2e0047 19-Jul-2012 Bob Peterson <rpeterso@redhat.com>

GFS2: Reduce file fragmentation

This patch reduces GFS2 file fragmentation by pre-reserving blocks. The
resulting improved on disk layout greatly speeds up operations in cases
which would have resulted in interlaced allocation of blocks previously.
A typical example of this is 10 parallel dd processes, each writing to a
file in a common dirctory.

The implementation uses an rbtree of reservations attached to each
resource group (and each inode).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ebfc3b49 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

don't pass nameidata to ->create()

boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 00cd8dd3 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

stop passing nameidata to ->lookup()

Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5407e242 18-May-2012 Bob Peterson <rpeterso@redhat.com>

GFS2: Fold quota data into the reservations struct

This patch moves the ancillary quota data structures into the
block reservations structure. This saves GFS2 some time and
effort in allocating and deallocating the qadata structure.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 0a305e49 06-Jun-2012 Bob Peterson <rpeterso@redhat.com>

GFS2: Extend the life of the reservations

This patch lengthens the lifespan of the reservations structure for
inodes. Before, they were allocated and deallocated for every write
operation. With this patch, they are allocated when the first write
occurs, and deallocated when the last process closes the file.
It's more efficient to do it this way because it saves GFS2 a lot of
unnecessary allocates and frees. It also gives us more flexibility
for the future: (1) we can now fold the qadata structure back into
the structure and save those alloc/frees, (2) we can use this for
multi-block reservations.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 5e2f7d61 04-Apr-2012 Bob Peterson <rpeterso@redhat.com>

GFS2: Make sure rindex is uptodate before starting transactions

This patch removes the call from gfs2_blk2rgrd to function
gfs2_rindex_update and replaces it with individual calls.
The former way turned out to be too problematic.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 66fc061b 07-Feb-2012 Steven Whitehouse <swhiteho@redhat.com>

GFS2: FITRIM ioctl support

The FITRIM ioctl provides an alternative way to send discard requests to
the underlying device. Using the discard mount option results in every
freed block generating a discard request to the block device. This can
be slow, since many block devices can only process discard requests of
larger sizes, and also such operations can be time consuming.

Rather than using the discard mount option, FITRIM allows a sweep of the
filesystem on an occasional basis, and also to optionally avoid sending
down discard requests for smaller regions.

In GFS2 FITRIM will work at resource group granularity. There is a flag
for each resource group which keeps track of which resource groups have
been trimmed. This flag is reset whenever a deallocation occurs in the
resource group, and set whenever a successful FITRIM of that resource
group has taken place. This helps to reduce repeated discard requests
for the same block ranges, again improving performance.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# a365fbf3 24-Feb-2012 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Read resource groups on mount

This makes mount take slightly longer, but at the same time, the first
write to the filesystem will be faster too. It also means that if there
is a problem in the resource index, then we can refuse to mount rather
than having to try and report that when the first write occurs.

In addition, to avoid recursive locking, we hvae to take account of
instances when the rindex glock may already be held when we are
trying to update the rbtree of resource groups.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 718b97bd 16-Feb-2012 Bob Peterson <rpeterso@redhat.com>

GFS2: Read in rindex if necessary during unlink

This patch fixes a problem whereby you were unable to delete
files until other file system operations were done (such as
statfs, touch, writes, etc.) that caused the rindex to be
read in.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 66ad863b 10-Jan-2012 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Fix nlink setting on inode creation

Since the nlink count will be 0, we need to use set_nlink rather
than inc_nlink in order to avoid triggering the inc_nlink warning
which was added recently.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 175a4eb7 26-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

fs: propagate umode_t, misc bits

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1a67aafb 25-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch ->mknod() to umode_t

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4acdaf27 25-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch ->create() to umode_t

vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 18bb1db3 25-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

switch vfs_mkdir() and ->mkdir() to umode_t

vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 46cc1e5f 23-Sep-2011 H Hartley Sweeten <hartleys@visionengravers.com>

GFS2: local functions should be static

Quiets the sparse noise:

warning: symbol 'gfs2_initxattrs' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 6a8099ed 21-Nov-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Fix multi-block allocation

Clean up gfs2_alloc_blocks so that it takes the full extent length
rather than just the number of non-inode blocks as an argument. That
will only make a difference in the inode allocation case for now.

Also, this fixes the extent length handling around gfs2_alloc_extent() so
that multi block allocations will work again.

The rd_last_alloc block is set to the final block in the allocated
extent (as per the update to i_goal, but referenced to a different
start point).

This also removes the dinode argument to rgblk_search() which is no
longer used.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 564e12b1 21-Nov-2011 Bob Peterson <rpeterso@redhat.com>

GFS2: decouple quota allocations from block allocations

This patch separates the code pertaining to allocations into two
parts: quota-related information and block reservations.
This patch also moves all the block reservation structure allocations to
function gfs2_inplace_reserve to simplify the code, and moves
the frees to function gfs2_inplace_release.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 6e87ed0f 18-Nov-2011 Bob Peterson <rpeterso@redhat.com>

GFS2: move toward a generic multi-block allocator

This patch is a revision of the one I previously posted.
I tried to integrate all the suggestions Steve gave.
The purpose of the patch is to change function gfs2_alloc_block
(allocate either a dinode block or an extent of data blocks)
to a more generic gfs2_alloc_blocks function that can
allocate both a dinode _and_ an extent of data blocks in the
same call. This will ultimately help us create a multi-block
reservation scheme to reduce file fragmentation.

This patch moves more toward a generic multi-block allocator that
takes a pointer to the number of data blocks to allocate, plus whether
or not to allocate a dinode. In theory, it could be called to allocate
(1) a single dinode block, (2) a group of one or more data blocks, or
(3) a dinode plus several data blocks.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 3c5d785a 14-Nov-2011 Bob Peterson <rpeterso@redhat.com>

GFS2: combine gfs2_alloc_block and gfs2_alloc_di

GFS2 functions gfs2_alloc_block and gfs2_alloc_di do basically
the same things, with a few exceptions. This patch combines
the two functions into a slightly more generic gfs2_alloc_block.
Having one centralized block allocation function will reduce
code redundancy and make it easier to implement multi-block
reservations to reduce file fragmentation in the future.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 87654896 08-Nov-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: More automated code analysis fixes

A potentially uninitialised variable, some unreachable code,
and the main part of this, fixing the error path in the
unlink function.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 54335b1f 01-Sep-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Cache the most recently used resource group in the inode

This means that after the initial allocation for any inode, the
last used resource group is cached in the inode for future use.
This drastically reduces the number of lookups of resource
groups in the common case, and this the contention on that
data structure.

The allocation algorithm is the same as previously, except that we
always check to see if the goal block is within the cached rgrp
first before going to the rbtree to look one up.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 8339ee54 31-Aug-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Make resource groups "append only" during life of fs

Since we have ruled out supporting online filesystem shrink,
it is possible to make the resource group list append only
during the life of a super block. This gives several benefits:

Firstly, we only need to read new rindex elements as they are added
rather than needing to reread the whole rindex file each time one
element is added.

Secondly, the rindex glock can be held for much shorter periods of
time, and is completely removed from the fast path for allocations.
The lock is taken in shared mode only when updating the resource
groups when the first allocation occurs, and after a grow has
taken place.

Thirdly, this results in a reduction in code size, and everything
gets a lot simpler to understand in this area.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 9a63edd1 18-Aug-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Clean up gfs2_create

If we pass through knowledge of whether the creation is intended to be
exclusive or not, then we can deal with that in gfs2_create_inode
and remove one set of locking. Also this removes the loop in
gfs2_create and simplifies the code a bit.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ab9bbda0 15-Aug-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Use ->dirty_inode()

The aim of this patch is to use the newly enhanced ->dirty_inode()
super block operation to deal with atime updates, rather than
piggy backing that code into ->write_inode() as is currently
done.

The net result is a simplification of the code in various places
and a reduction of the number of gfs2_dinode_out() calls since
this is now implied by ->dirty_inode().

Some of the mark_inode_dirty() calls have been moved under glocks
in order to take advantage of then being able to avoid locking in
->dirty_inode() when we already have suitable locks.

One consequence is that generic_write_end() now correctly deals
with file size updates, so that we do not need a separate check
for that afterwards. This also, indirectly, means that fdatasync
should work correctly on GFS2 - the current code always syncs the
metadata whether it needs to or not.

Has survived testing with postmark (with and without atime) and
also fsx.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 40ac218f 02-Aug-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Fix inode allocation error path

If we have got far enough through the inode allocation code
path that an inode has already been allocated, then we must
call iput to dispose of it, if an error occurs during a
later part of the process. This will always be the final iput
since there will be no other references to the inode.

Unlike when the inode has been unlinked, its block state will
be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need
to skip the test in ->evict_inode() for this one case in order
to ensure that it will be deallocated correctly. This patch adds
a new flag in order to ensure that this will happen correctly.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 4e34e719 23-Jul-2011 Christoph Hellwig <hch@lst.de>

fs: take the ACL checks to common code

Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6c673ab3 17-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

simplify gfs2_lookup()

d_splice_alias() will DTRT when given NULL or ERR_PTR

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 10556cb2 20-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

->permission() sanitizing: don't pass flags to ->permission()

not used by the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2830ba7f 20-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

->permission() sanitizing: don't pass flags to generic_permission()

redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 178ea735 20-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

kill check_acl callback of generic_permission()

its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9d8f13ba 06-Jun-2011 Mimi Zohar <zohar@linux.vnet.ibm.com>

security: new security_inode_init_security API adds function callback

This patch changes the security_inode_init_security API by adding a
filesystem specific callback to write security extended attributes.
This change is in preparation for supporting the initialization of
multiple LSM xattrs and the EVM xattr. Initially the callback function
walks an array of xattrs, writing each xattr separately, but could be
optimized to write multiple xattrs at once.

For existing security_inode_init_security() calls, which have not yet
been converted to use the new callback function, such as those in
reiserfs and ocfs2, this patch defines security_old_inode_init_security().

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>


# f2741d98 12-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move all locking inside the inode creation function

Now that there are no longer any exceptions to the normal inode
creation code path, we can move the parts of the locking code
which were duplicated in mkdir/mknod/create/symlink into the
inode create function.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 160b4026 13-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Clean up symlink creation

This moves the symlink specific parts of inode creation
into the function where we initialise the rest of the
dinode. As a result we have one less place where we need
to look up the inode's buffer.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# e2d0a13b 13-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Clean up mkdir

This moves the initialisation of the directory into the inode
creation functions to avoid having to duplicate the lookup
of the inode's buffer.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 2ab9cd1c 10-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Rename ops_inode.c to inode.c

This is the final part of the ops_inode.c/inode.c reordering. We
are left with a single file called inode.c which now contains
all the inode operations, as expected.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 64ea5402 10-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Inode.c is empty now, remove it

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 9eed04cd 09-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move final part of inode.c into super.c

Now inode.c is empty.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 194c011f 09-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move most of the remaining inode.c into ops_inode.c

This is in preparation to remove inode.c and rename ops_inode.c
to inode.c. Also most of the functions which were left in inode.c
relate to the creation and lookup of inodes. I'm intending to work
on consolidating some of that code, and its easier when its all in
one place.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# d4b2cf1b 09-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move gfs2_refresh_inode() and friends into glops.c

Eventually there will only be a single caller of this code, so lets
move it where it can be made static at some future date.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 94fb763b 09-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Remove gfs2_dinode_print() function

This function was intended for debugging purposes, but it is not very
useful. If we want to know what is on disk then all we need is a
block number and gfs2_edit can give us much better information about
what is there. Otherwise, if we are interested in what is stored in
the in-core inode, it doesn't help us out there either.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 3d6ecb7d 09-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: When adding a new dir entry, inc link count if it is a subdir

This adds an increment of the link count when we add a new directory
entry, if that entry is itself a directory. This means that we no
longer need separate code to perform this operation.

Now that both adding and removing directory entries automatically
update the parent directory's link count if required, that makes
the code shorter and simpler than before.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# d192a8e5 04-May-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Double check link count under glock

To avoid any possible races relating to the link count, we need to
recheck it under the inode's glock in all cases where it matters.
Also to ensure we never get any nasty surprises, this patch also
ensures that once the link count has hit zero it can never be
elevated by rereading in data from disk.

The only place we cannot provide a proper solution is in rename
in the case where we are removing a target inode and we discover
that the target inode has been already unlinked on another node.
The race window is very small, and we return EAGAIN in this case
to indicate what has happened. The proper solution would be to move
the lookup parts of rename from the vfs into library calls which
the fs could call directly, but that is potentially a very big job
and this fix should cover most cases for now.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 4667a0ec 18-Apr-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Make writeback more responsive to system conditions

This patch adds writeback_control to writing back the AIL
list. This means that we can then take advantage of the
information we get in ->write_inode() in order to set off
some pre-emptive writeback.

In addition, the AIL code is cleaned up a bit to make it
a bit simpler to understand.

There is still more which can usefully be done in this area,
but this is a good start at least.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# f42ab085 14-Apr-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Optimise glock lru and end of life inodes

The GLF_LRU flag introduced in the previous patch can be
used to check if a glock is on the lru list when a new
holder is queued and if so remove it, without having first
to get the lru_lock.

The main purpose of this patch however is to optimise the
glocks left over when an inode at end of life is being
evicted. Previously such glocks were left with the GLF_LFLUSH
flag set, so that when reclaimed, each one required a log flush.
This patch resets the GLF_LFLUSH flag when there is nothing
left to flush thus preventing later log flushes as glocks are
reused or demoted.

In order to do this, we need to keep track of the number of
revokes which are outstanding, and also to clear the GLF_LFLUSH
bit after a log commit when only revokes have been processed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 44ad37d6 17-Mar-2011 Bob Peterson <rpeterso@redhat.com>

GFS2: filesystem hang caused by incorrect lock order

This patch fixes a deadlock in GFS2 where two processes are trying
to reclaim an unlinked dinode:
One holds the inode glock and calls gfs2_lookup_by_inum trying to look
up the inode, which it can't, due to I_FREEING. The other has set
I_FREEING from vfs and is at the beginning of gfs2_delete_inode
waiting for the glock, which is held by the first. The solution is to
add a new non_block parameter to the gfs2_iget function that causes it
to return -ENOENT if the inode is being freed.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 2a7dba39 01-Feb-2011 Eric Paris <eparis@redhat.com>

fs/vfs/security: pass last path component to LSM on inode creation

SELinux would like to implement a new labeling behavior of newly created
inodes. We currently label new inodes based on the parent and the creating
process. This new behavior would also take into account the name of the
new object when deciding the new label. This is not the (supposed) full path,
just the last component of the path.

This is very useful because creating /etc/shadow is different than creating
/etc/passwd but the kernel hooks are unable to differentiate these
operations. We currently require that userspace realize it is doing some
difficult operation like that and than userspace jumps through SELinux hoops
to get things set up correctly. This patch does not implement new
behavior, that is obviously contained in a seperate SELinux patch, but it
does pass the needed name down to the correct LSM hook. If no such name
exists it is fine to pass NULL.

Signed-off-by: Eric Paris <eparis@redhat.com>


# 24d9765f 18-Jan-2011 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Fix error path in gfs2_lookup_by_inum()

In the (impossible, except if there is fs corruption) error path
in gfs2_lookup_by_inum() if the call to gfs2_inode_refresh()
fails, it was leaving the function by calling iput() rather
than iget_failed(). This would cause future lookups of the same
inode to block forever.

This patch fixes the problem by moving the call to gfs2_inode_refresh()
into gfs2_inode_lookup() where iget_failed() is part of the error path
already. Also this cleans up some unreachable code and makes
gfs2_set_iop() static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# b74c79e9 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: provide rcu-walk aware permission i_ops

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 9e55cd53 09-Nov-2010 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Remove unreachable calls to vmtruncate

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 044b9414 03-Nov-2010 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Fix inode deallocation race

This area of the code has always been a bit delicate due to the
subtleties of lock ordering. The problem is that for "normal"
alloc/dealloc, we always grab the inode locks first and the rgrp lock
later.

In order to ensure no races in looking up the unlinked, but still
allocated inodes, we need to hold the rgrp lock when we do the lookup,
which means that we can't take the inode glock.

The solution is to borrow the technique already used by NFS to solve
what is essentially the same problem (given an inode number, look up
the inode carefully, checking that it really is in the expected
state).

We cannot do that directly from the allocation code (lock ordering
again) so we give the job to the pre-existing delete workqueue and
carry on with the allocation as normal.

If we find there is no space, we do a journal flush (required anyway
if space from a deallocation is to be released) which should block
against the pending deallocations, so we should always get the space
back.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# a2e0f799 11-Aug-2010 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Remove i_disksize

With the update of the truncate code, ip->i_disksize and
inode->i_size are merely copies of each other. This means
we can remove ip->i_disksize and use inode->i_size exclusively
reducing the size of a GFS2 inode by 8 bytes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# a4ffdde6 02-Jun-2010 Al Viro <viro@zeniv.linux.org.uk>

simplify checks for I_CLEAR/I_FREEING

add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is
equivalent to I_FREEING for almost all code looking at either;
it's there to keep track of having called clear_inode() exactly
once per inode lifetime, at some point after having set I_FREEING.
I_CLEAR and I_FREEING never get set at the same time with the
current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
instead of I_CLEAR without loss of information. As the result of
such change, checks become simpler and the amount of code that needs
to know about I_CLEAR shrinks a lot.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1025774c 04-Jun-2010 Christoph Hellwig <hch@lst.de>

remove inode_setattr

Replace inode_setattr with opencoded variants of it in all callers. This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.

In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:

spufs: explicitly checks for ATTR_SIZE earlier
btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b1becbde 17-Jun-2010 Bob Peterson <rpeterso@redhat.com>

GFS2: Fix kernel NULL pointer dereference by dlm_astd

This patch fixes a problem in an error path when looking
up dinodes. There are two sister-functions, gfs2_inode_lookup
and gfs2_process_unlinked_inode. Both functions acquire and
hold the i_iopen glock for the dinode being looked up. The last
thing they try to do is hold the i_gl glock for the dinode.
If that glock fails for some reason, the error path was
incorrectly calling gfs2_glock_put for the i_iopen glock twice.
This resulted in the glock being prematurely freed. The
"minimum hold time" usually kept the glock in memory, but the
lock interface to dlm (aka lock_dlm) freed its memory for the
glock. In some circumstances, it would cause dlm's dlm_astd daemon
to try to call the bast function for the freed lock_dlm memory,
which resulted in a NULL pointer dereference.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ed4878e8 20-May-2010 Bob Peterson <rpeterso@redhat.com>

GFS2: Rework reclaiming unlinked dinodes

The previous patch I wrote for reclaiming unlinked dinodes
had some shortcomings and did not prevent all hangs.
This version is much cleaner and more logical, and has
passed very difficult testing. Sorry for the churn.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 1a0eae88 14-Apr-2010 Bob Peterson <rpeterso@redhat.com>

GFS2: glock livelock

This patch fixes a couple gfs2 problems with the reclaiming of
unlinked dinodes. First, there were a couple of livelocks where
everything would come to a halt waiting for a glock that was
seemingly held by a process that no longer existed. In fact, the
process did exist, it just had the wrong pid number in the holder
information. Second, there was a lock ordering problem between
inode locking and glock locking. Third, glock/inode contention
could sometimes cause inodes to be improperly marked invalid by
iget_failed.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>


# 009d8518 07-Dec-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Metadata address space clean up

Since the start of GFS2, an "extra" inode has been used to store
the metadata belonging to each inode. The only reason for using
this inode was to have an extra address space, the other fields
were unused. This means that the memory usage was rather inefficient.

The reason for keeping each inode's metadata in a separate address
space is that when glocks are requested on remote nodes, we need to
be able to efficiently locate the data and metadata which relating
to that glock (inode) in order to sync or sync and invalidate it
(depending on the remotely requested lock mode).

This patch adds a new type of glock, which has in addition to
its normal fields, has an address space. This applies to all
inode and rgrp glocks (but to no other glock types which remain
as before). As a result, we no longer need to have the second
inode.

This results in three major improvements:
1. A saving of approx 25% of memory used in caching inodes
2. A removal of the circular dependency between inodes and glocks
3. No confusion between "normal" and "metadata" inodes in super.c

Although the first of these is the more immediately apparent, the
second is just as important as it now enables a number of clean
ups at umount time. Those will be the subject of future patches.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# eaff8079 17-Dec-2009 Christoph Hellwig <hch@lst.de>

kill I_LOCK

After I_SYNC was split from I_LOCK the leftover is always used together with
I_NEW and thus superflous.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 431547b3 13-Nov-2009 Christoph Hellwig <hch@lst.de>

sanitize xattr handler prototypes

Add a flags argument to struct xattr_handler and pass it to all xattr
handler methods. This allows using the same methods for multiple
handlers, e.g. for the ACL methods which perform exactly the same action
for the access and default ACLs, just using a different underlying
attribute. With a little more groundwork it'll also allow sharing the
methods for the regular user/trusted/secure handlers in extN, ocfs2 and
jffs2 like it's already done for xfs in this patch.

Also change the inode argument to the handlers to a dentry to allow
using the handlers mechnism for filesystems that require it later,
e.g. cifs.

[with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0ab7d13f 06-Nov-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Tag all metadata with jid

There are two spare field in the header common to all GFS2
metadata. One is just the right size to fit a journal id
in it, and this patch updates the journal code so that each
time a metadata block is modified, we tag it with the journal
id of the node which is performing the modification.

The reason for this is that it should make it much easier to
debug issues which arise if we can tell which node was the
last to modify a particular metadata block.

Since the field is updated before the block is written into
the journal, each journal should only contain metadata which
is tagged with its own journal id. The one exception to this
is the journal header block, which might have a different node's
id in it, if that journal was recovered by another node in the
cluster.

Thus each journal will contain a record of which nodes recovered
it, via the journal header.

The other field in the metadata header could potentially be
used to hold information about what kind of operation was
performed, but for the time being we just zero it on each
transaction so that if we use it for that in future, we'll
know that the information (where it exists) is reliable.

I did consider using the other field to hold the journal
sequence number, however since in GFS2's journaling we write
the modified data into the journal and not the original
data, this gives no information as to what action caused the
modification, so I think we can probably come up with a better
use for those 64 bits in the future.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 479c427d 01-Oct-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Clean up ACLs

To prepare for support for caching of ACLs, this cleans up the GFS2
ACL support by pushing the xattr code back into xattr.c and changing
the acl_get function into one which only returns ACLs so that we
can drop the caching function into it shortly.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 8d8291ae 27-Aug-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Remove no_formal_ino generating code

The inum structure used throughout GFS2 has two fields. One
no_addr is the disk block number of the inode in question and
is used everywhere as the inode number. The other, no_formal_ino,
is used only as the generation number for NFS.

Historically the no_formal_ino field was set using a complicated
system of one global and one per-node file containing inode numbers
in order to ensure that each no_formal_ino was unique. Also this
code made no provision for what would happen when eventually the
(64 bit) numbers ran out. Now I know that is pretty unlikely to
happen given the large space of numbers, but it is possible
nevertheless.

The only guarantee required for no_formal_ino is that, for any
single inode, the same number doesn't get reused too quickly.

We already have a generation number which is kept in the inode
and initialised from a counter in the resource group (almost
no overhead, since we have to touch the resource group anyway
in order to allocate an inode in the first place). Aside from
ensuring that we never use the value 0 in the no_formal_ino
field, we can use that counter directly.

As a result of that change, we lose about 200 lines of code and
also gain about 10 creates/sec on the postmark benchmark (on
my test machine).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 307cf6e6 26-Aug-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Rename eattr.[ch] as xattr.[ch]

Use the more conventional name for the extended attribute
support code. Update all the places which care.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 40b78a32 26-Aug-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Clean up of extended attribute support

This has been on my list for some time. We need to change the way
in which we handle extended attributes to allow faster file creation
times (by reducing the number of transactions required) and the
extended attribute code is the main obstacle to this.

In addition to that, the VFS provides a way to demultiplex the xattr
calls which we ought to be using, rather than rolling our own. This
patch changes the GFS2 code to use that VFS feature and as a result
the code shrinks by a couple of hundred lines or so, and becomes
easier to read.

I'm planning on doing further clean up work in this area, but this
patch is a good start. The cleaned up code also uses the more usual
"xattr" shorthand, I plan to eliminate the use of "eattr" eventually
and in the mean time it serves as a flag as to which bits of the code
have been updated.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 6050b9c7 31-Jul-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Improve error handling in inode allocation

A little while back, block allocation was given some improved
error handling which meant that -EIO was returned in the case
of there being a problem in the resource group data. In addition
a message is printed explaning what went wrong and how to fix it.
This extends that error handling so that it also covers inode
allocation too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 87ec2174 22-May-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move gfs2_unlink_ok into ops_inode.c

Another function which is only called from one ops_inode.c so
we move it and make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 536baf02 22-May-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move gfs2_readlinki into ops_inode.c

Move gfs2_readlinki into ops_inode.c and make it static

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 2286dbfa 22-May-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move gfs2_rmdiri into ops_inode.c

Move gfs2_rmdiri() into ops_inode.c and make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# b1e71b06 22-May-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Clean up some file names

This patch renames the ops_*.c files which have no counterpart
without the ops_ prefix in order to shorten the name and make
it more readable. In addition, ops_address.h (which was very
small) is moved into inode.h and inode.h is cleaned up by
adding extern where required.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 10d21988 07-Apr-2009 Christoph Hellwig <hch@lst.de>

GFS2: cleanup file_operations mess

Remove the weird pointer to file_operations mess and replace it with
straight-forward defining of the lockinginstance names to the _nolock
variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# f057f6cd 12-Jan-2009 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Merge lock_dlm module into GFS2

This is the big patch that I've been working on for some time
now. There are many reasons for wanting to make this change
such as:
o Reducing overhead by eliminating duplicated fields between structures
o Simplifcation of the code (reduces the code size by a fair bit)
o The locking interface is now the DLM interface itself as proposed
some time ago.
o Fewer lookups of glocks when processing replies from the DLM
o Fewer memory allocations/deallocations for each glock
o Scope to do further optimisations in the future (but this patch is
more than big enough for now!)

Please note that (a) this patch relates to the lock_dlm module and
not the DLM itself, that is still a separate module; and (b) that
we retain the ability to build GFS2 as a standalone single node
filesystem with out requiring the DLM.

This patch needs a lot of testing, hence my keeping it I restarted
my -git tree after the last merge window. That way, this has the maximum
exposure before its merged. This is (modulo a few minor bug fixes) the
same patch that I've been posting on and off the the last three months
and its passed a number of different tests so far.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 97cc1025 20-Nov-2008 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Kill two daemons with one patch

This patch removes the two daemons, gfs2_scand and gfs2_glockd
and replaces them with a shrinker which is called from the VM.

The net result is that GFS2 responds better when there is memory
pressure, since it shrinks the glock cache at the same rate
as the VFS shrinks the dcache and icache. There are no longer
any time based criteria for shrinking glocks, they are kept
until such time as the VM asks for more memory and then we
demote just as many glocks as required.

There are potential future changes to this code, including the
possibility of sorting the glocks which are to be written back
into inode number order, to get a better I/O ordering. It would
be very useful to have an elevator based workqueue implementation
for this, as that would automatically deal with the read I/O cases
at the same time.

This patch is my answer to Andrew Morton's remark, made during
the initial review of GFS2, asking why GFS2 needs so many kernel
threads, the answer being that it doesn't :-) This patch is a
net loss of about 200 lines of code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 383f01fb 04-Nov-2008 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Banish struct gfs2_dinode_host

The final field in gfs2_dinode_host was the i_flags field. Thats
renamed to i_diskflags in order to avoid confusion with the existing
inode flags, and moved into the inode proper at a suitable location
to avoid creating a "hole".

At that point struct gfs2_dinode_host is no longer needed and as
promised (quite some time ago!) it can now be removed completely.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# c9e98886 04-Nov-2008 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move i_size from gfs2_dinode_host and rename it to i_disksize

This patch moved the i_size field from the gfs2_dinode_host and
following the ext3 convention renames it i_disksize.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 3767ac21 03-Nov-2008 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move di_eattr into "proper" inode

This moves the di_eattr field out of gfs2_inode_host and
into the inode proper.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ad6203f2 03-Nov-2008 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move "entries" into "proper" inode

This moves the directory entry count into the proper inode.
Potentially we could get this to share the space used by
something else in the future, but this is one more step
on the way to removing the gfs2_dinode_host structure.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# bcf0b5b3 03-Nov-2008 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Move generation number into "proper" part of inode

This moves the generation number from the gfs2_dinode_host
into the gfs2_inode structure. Eventually the plan is to get
rid of the gfs2_dinode_host structure completely.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# b2760583 14-Oct-2008 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Rationalise header files

Move the contents of some headers which contained very
little into more sensible places, and remove the original
header files. This should make it easier to find things.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 3de7be33 13-Nov-2008 David Howells <dhowells@redhat.com>

CRED: Wrap task credential accesses in the GFS2 filesystem

Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.

Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: James Morris <jmorris@namei.org>


# 719ee344 18-Sep-2008 Steven Whitehouse <swhiteho@redhat.com>

GFS2: high time to take some time over atime

Until now, we've used the same scheme as GFS1 for atime. This has failed
since atime is a per vfsmnt flag, not a per fs flag and as such the
"noatime" flag was not getting passed down to the filesystems. This
patch removes all the "special casing" around atime updates and we
simply use the VFS's atime code.

The net result is that GFS2 will now support all the same atime related
mount options of any other filesystem on a per-vfsmnt basis. We do lose
the "lazy atime" updates, but we gain "relatime". We could add lazy
atime to the VFS at a later date, if there is a requirement for that
variant still - I suspect relatime will be enough.

Also we lose about 100 lines of code after this patch has been applied,
and I have a suspicion that it will speed things up a bit, even when
atime is "on". So it seems like a nice clean up as well.

From a user perspective, everything stays the same except the loss of
the per-fs atime quantum tweekable (ought to be per-vfsmnt at the very
least, and to be honest I don't think anybody ever used it) and that a
number of options which were ignored before now work correctly.

Please let me know if you've got any comments. I'm pushing this out
early so that you can all see what my plans are.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# bd1eb881 01-Sep-2008 Julien Brunel <brunel@diku.dk>

GFS2: Use an IS_ERR test rather than a NULL test

In case of error, the function gfs2_inode_lookup returns an
ERR pointer, but never returns a NULL pointer. So a NULL test that
necessarily comes after an IS_ERR test should be deleted, and a NULL
test that may come after a call to this function should be
strengthened by an IS_ERR test.

The semantic match that finds this problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@match_bad_null_test@
expression x, E;
statement S1,S2;
@@
x = gfs2_inode_lookup(...)
... when != x = E
* if (x != NULL)
S1 else S2
// </smpl>

Signed-off-by: Julien Brunel <brunel@diku.dk>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 0188d6c5 26-Aug-2008 Steven Whitehouse <swhiteho@redhat.com>

GFS2: Fix & clean up GFS2 rename

This patch fixes a locking issue in the rename code by ensuring that we hold
the per sb rename lock over both directory and "other" renames which involve
different parent directories.

At the same time, this moved the (only called from one place) function
gfs2_ok_to_move into the file that its called from, so we can mark it
static. This should make a code a bit easier to follow.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Peter Staubach <staubach@redhat.com>


# a569c711 23-Jul-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] don't pass nameidata to gfs2_lookupi()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c9f6a6bb 10-Jul-2008 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove support for unused and pointless flag

The ability to mark files for direct i/o access when opened
normally is both unused and pointless, so this patch removes
support for that feature.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# f58ba889 02-Jul-2008 Miklos Szeredi <miklos@szeredi.hu>

[GFS2] don't call permission()

GFS2 calls permission() to verify permissions after locks on the files
have been taken.

For this it's sufficient to call gfs2_permission() instead. This
results in the following changes:

- IS_RDONLY() check is not performed
- IS_IMMUTABLE() check is not performed
- devcgroup_inode_permission() is not called
- security_inode_permission() is not called

IS_RDONLY() should be unnecessary anyway, as the per-mount read-only
flag should provide protection against read-only remounts during
operations. do_gfs2_set_flags() has been fixed to perform
mnt_want_write()/mnt_drop_write() to protect against remounting
read-only.

IS_IMMUTABLE has been added to gfs2_permission()

Repeating the security checks seems to be pointless, as they don't
normally change, and if they do, it's independent of the filesystem
state.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 091806ed 28-Apr-2008 Bob Peterson <rpeterso@redhat.com>

[GFS2] filesystem consistency error from do_strip

This patch fixes a GFS2 filesystem consistency error reported from
function do_strip. The problem was caused by a timing window
that allowed two vfs inodes to be created in memory that point
to the same file. The problem is fixed by making the vfs's
iget_test, iget_set mechanism check and set a new bit in the
in-core gfs2_inode structure while the vfs inode spin_lock is held.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 16c5f06f 09-Apr-2008 Josef Bacik <jbacik@redhat.com>

[GFS2] fix GFP_KERNEL misuses

There are several places where GFP_KERNEL allocations happen under a glock,
which will result in hangs if we're under memory pressure and go to re-enter the
fs in order to flush stuff out. This patch changes the culprits to GFS_NOFS to
keep this problem from happening. Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 182fe5ab 03-Mar-2008 Cyrill Gorcunov <gorcunov@gmail.com>

[GFS2] possible null pointer dereference fixup

gfs2_alloc_get may fail so we have to check it to prevent
NULL pointer dereference.

Signed-off-by: Cyrill Gorcunov <gorcunov@gamil.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 43a33c53 26-Feb-2008 Denis Cheng <crquan@gmail.com>

[GFS2] re-support special inode

a previous commit removed call to
init_special_inode from inode lookuping, this cause problems as:

# mknod /mnt/gfs2/dev/null c 1 3
# cat /mnt/gfs2/dev/null
cat: /mnt/gfs2/dev/null: Invalid argument

without special inode, GFS2 cannot support char device file,
block device file, fifo pipe, and socket file, lose many important
features as a common file system.

this one line patch re add special inode support.

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# d83225d4 26-Feb-2008 Denis Cheng <crquan@gmail.com>

[GFS2] remove gfs2_dev_iops

struct inode_operations gfs2_dev_iops is always the same as gfs2_file_iops,
since Jan 2006, when GFS2 merged into mainstream kernel.

So one of them could be removed.

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 7afd88d9 22-Feb-2008 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix a page lock / glock deadlock

We've previously been using a "try lock" in readpage on the basis that
it would prevent deadlocks due to the inverted lock ordering (our normal
lock ordering is glock first and then page lock). Unfortunately tests
have shown that this isn't enough. If the glock has a demote request
queued such that run_queue() in the glock code tries to do a demote when
its called under readpage then it will try and write out all the dirty
pages which requires locking them. This then deadlocks with the page
locked by readpage.

The solution is to always require two calls into readpage. The first
unlocks the page, gets the glock and returns AOP_TRUNCATED_PAGE, the
second does the actual readpage and unlocks the glock & page as
required.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 77658aad 12-Feb-2008 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Eliminate (almost) duplicate field from gfs2_inode

The blocks counter is almost a duplicate of the i_blocks
field in the VFS inode. The only difference is that i_blocks
can be only 32bits long for 32bit arch without large single file
support. Since GFS2 doesn't handle the non-large single file
case (for 32 bit anyway) this adds a new config dependency on
64BIT || LSF. This has always been the case, however we've never
explicitly said so before.

Even if we do add support for the non-LSF case, we will still
not require this field to be duplicated since we will not be
able to access oversized files anyway.

So the net result of all this is that we shave 8 bytes from a gfs2_inode
and get our config deps correct.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ce276b06 06-Feb-2008 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Reduce inode size by merging fields

There were three fields being used to keep track of the location
of the most recently allocated block for each inode. These have
been merged into a single field in order to better keep the
data and metadata for an inode close on disk, and also to reduce
the space required for storage.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 9a004508 01-Feb-2008 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Shrink & rename di_depth

This patch forms a pair with the previous patch which shrunk
di_height. Like that patch di_depth is renamed i_depth and moved
into struct gfs2_inode directly. Also the field goes from 16 bits
to 8 bits since it is also limited to a max value which is rather
small (17 in this case). In addition we also now validate the field
against this maximum value when its read in.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ca390601 28-Jan-2008 Bob Peterson <rpeterso@redhat.com>

[GFS2] Fix debug inode printing

I noticed that the latest change to i_height got rid of the
value from the inode dump. This patch adds it back.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ecc30c79 28-Jan-2008 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Streamline indirect pointer tree height calculation

This patch improves the calculation of the tree height in order to reduce
the number of operations which are carried out on each call to gfs2_block_map.
In the common case, we now make a single comparison, rather than calculating
the required tree height from scratch each time. Also in the case that the
tree does need some extra height, we start from the current height rather from
zero when we work out what the new height ought to be.

In addition the di_height field is moved into the inode proper and reduced
in size to a u8 since the value must be between 0 and GFS2_MAX_META_HEIGHT (10).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 69840b0d 07-Feb-2008 David Howells <dhowells@redhat.com>

iget: use iget_failed() in GFS2

Use iget_failed() in GFS2 to kill a failed inode.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1b8177ec 19-Jan-2008 Bob Peterson <rpeterso@redhat.com>

[GFS2] Lockup on error

I spotted this bug while I was digging around. Looks like it could cause
a lockup in some rare error condition.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 6dbd8224 10-Jan-2008 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Reduce inode size by moving i_alloc out of line

It is possible to reduce the size of GFS2 inodes by taking the i_alloc
structure out of the gfs2_inode. This patch allocates the i_alloc
structure whenever its needed, and frees it afterward. This decreases
the amount of low memory we use at the expense of requiring a memory
allocation for each page or partial page that we write. A quick test
with postmark shows that the overhead is not measurable and I also note
that OCFS2 use the same approach.

In the future I'd like to solve the problem by shrinking down the size
of the members of the i_alloc structure, but for now, this reduces the
immediate problem of using too much low-memory on x86 and doesn't add
too much overhead.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# c97bfe43 29-Nov-2007 Wendy Cheng <wcheng@redhat.com>

[GFS2] Remove lock methods for lock_nolock protocol

GFS2 supports two modes of locking - lock_nolock for single node filesystem
and lock_dlm for cluster mode locking. The gfs2 lock methods are removed from
file operation table for lock_nolock protocol. This would allow VFS to handle
posix lock and flock logics just like other in-tree filesystems without
duplication.

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 2bcd610d 08-Nov-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Don't add glocks to the journal

The only reason for adding glocks to the journal was to keep track
of which locks required a log flush prior to release. We add a
flag to the glock to allow this check to be made in a simpler way.

This reduces the size of a glock (by 12 bytes on i386, 24 on x86_64)
and means that we can avoid extra work during the journal flush.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 5561093e 17-Oct-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Introduce gfs2_set_aops()

Just like ext3 we now have three sets of address space operations
to cover the cases of writeback, ordered and journalled data
writes. This means that the individual operations can now become
less complicated as we are able to remove some of the tests for
file data mode from the code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# f91a0d3e 15-Oct-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove useless i_cache from inodes

The i_cache was designed to keep references to the indirect blocks
used during block mapping so that they didn't have to be looked
up continually. The idea failed because there are too many places
where the i_cache needs to be freed, and this has in the past been
the cause of many bugs.

In addition there was no performance benefit being gained since the
disk blocks in question were cached anyway. So this patch removes
it in order to simplify the code to prepare for other changes which
would otherwise have had to add further support for this feature.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 51ff87bd 15-Oct-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Clean up internal read function

As requested by Christoph, this patch cleans up GFS2's internal
read function so that it no longer uses the do_generic_mapping_read
function. This function is obsolete and GFS2 is the last user of it.

As a side effect the internal read code gets smaller and easier
to read and gfs2_readpage is split into two. One function has the locking
and the other function has the rest of the logic.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>


# 7a9f53b3 18-Sep-2007 Benjamin Marzinski <bmarzins@redhat.com>

[GFS2] Alternate gfs2_iget to avoid looking up inodes being freed

There is a possible deadlock between two processes on the same node, where one
process is deleting an inode, and another process is looking for allocated but
unused inodes to delete in order to create more space.

process A does an iput() on inode X, and it's i_count drops to 0. This causes
iput_final() to be called, which puts an inode into state I_FREEING at
generic_delete_inode(). There no point between when iput_final() is called, and
when I_FREEING is set where GFS2 could acquire any glocks. Once I_FREEING is
set, no other process on that node can successfully look up that inode until
the delete finishes.

process B locks the the resource group for the same inode in get_local_rgrp(),
which is called by gfs2_inplace_reserve_i()

process A tries to lock the resource group for the inode in
gfs2_dinode_dealloc(), but it's already locked by process B

process B waits in find_inode for the inode to have the I_FREEING state cleared.

Deadlock.

This patch solves the problem by adding an alternative to gfs2_iget(),
gfs2_iget_skip(), that simply skips any inodes that are in the I_FREEING
state.o The alternate test function is just like the original one, except that
it fails if the inode is being freed, and sets a skipped flag. The alternate
set function is just like the original, except that it fails if the skipped
flag is set. Only try_rgrp_unlink() calls gfs2_iget_skip() instead of
gfs2_iget().

Signed-off-by: Benjamin E. Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# e9bd2b3b 24-Aug-2007 Wendy Cheng <wcheng@redhat.com>

[GFS2] fix inode meta data corruption

Fix a nasty inode meta data corruption issue by keeping the buffer head in
icache array. This buffer needs to stay in memory until journal flush occurs
Otherwise, gfs2_meta_inode_buffer could do a disk read before the inode hits
disk. It ends up with meta data corruptions. The buffer will be released as
part of the existing journal flush logic.

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 35dcc52e 27-Jun-2007 Wendy Cheng <wcheng@redhat.com>

[GFS2] Remove i_mode passing from NFS File Handle

GFS2 has been passing i_mode within NFS File Handle. Other than the
wrong assumption that there is always room for this extra 16 bit value,
the current gfs2_get_dentry doesn't really need the i_mode to work
correctly. Note that GFS2 NFS code does go thru the same lookup code
path as direct file access route (where the mode is obtained from name
lookup) but gfs2_get_dentry() is coded for different purpose. It is not
used during lookup time. It is part of the file access procedure call.
When the call is invoked, if on-disk inode is not in-memory, it has to
be read-in. This makes i_mode passing a useless overhead.

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# bb9bcf06 27-Jun-2007 Wendy Cheng <wcheng@redhat.com>

[GFS2] Obtaining no_formal_ino from directory entry

GFS2 lookup code doesn't ask for inode shared glock. This implies during
in-memory inode creation for existing file, GFS2 will not disk-read in
the inode contents. This leaves no_formal_ino un-initialized during
lookup time. The un-initialized no_formal_ino is subsequently encoded
into file handle. Clients will get ESTALE error whenever it tries to
access these files.

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# d93cfa98 11-Jun-2007 Abhijith Das <adas@redhat.com>

[GFS2] Fix deallocation issues

There were two issues during deallocation of unlinked inodes. The
first was relating to the use of a "try" lock which in the case of
the inode lock wasn't trying hard enough to deallocate in all
circumstances (now changed to a normal glock) and in the case of
the iopen lock didn't wait for the demotion of the shared lock before
attempting to get the exclusive lock, and thereby sometimes (timing dependent)
not completing the deallocation when it should have done.

The second issue related to the lack of a way to invalidate dcache entries
on remote nodes (now fixed by this patch) which meant that unlinks were
taking a long time to return disk space to the fs. By adding some code to
invalidate the dcache entries across the cluster for unlinked inodes, that
is now fixed.

This patch was written jointly by Abhijith Das and Steven Whitehouse.

Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 037bcbb7 08-Jun-2007 Andrew Morton <akpm@linux-foundation.org>

[GFS2] gfs2_lookupi() uninitialised var fix

fs/gfs2/inode.c: In function 'gfs2_lookupi':
fs/gfs2/inode.c:392: warning: 'error' may be used uninitialized in this function

Looks like a real bug to me.

Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# c8cdf479 08-Jun-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Recovery for lost unlinked inodes

Under certain circumstances its possible (though rather unlikely) that
inodes which were unlinked by one node while still open on another might
get "lost" in the sense that they don't get deallocated if the node
which held the inode open crashed before it was unlinked.

This patch adds the recovery code which allows automatic deallocation of
the inode if its found during block allocation (the sensible time to
look for such inodes since we are scanning the rgrp's bitmaps anyway at
this time, so it adds no overhead to do this).

Since the inode will have had its i_nlink set to zero, all we need to
trigger recovery is a lookup and an iput(), and the normal deallocation
code takes care of the rest.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# e1cc8603 07-Jun-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix bug in error path of inode

This fixes a bug in the ordering of operations in the error path of
createi. Its not valid to do an iput() when holding the inode's glock
since the iput() will (in this case) result in delete_inode() being
called which needs to grab the lock itself. This was causing the
recursive lock checking code to trigger.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 4bd91ba1 05-Jun-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Add nanosecond timestamp feature

This adds a nanosecond timestamp feature to the GFS2 filesystem. Due
to the way that the on-disk format works, older filesystems will just
appear to have this field set to zero. When mounted by an older version
of GFS2, the filesystem will simply ignore the extra fields so that
it will again appear to have whole second resolution, so that its
trivially backward compatible.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# bb8d8a6f 01-Jun-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix sign problem in quota/statfs and cleanup _host structures

This patch fixes some sign issues which were accidentally introduced
into the quota & statfs code during the endianess annotation process.
Also included is a general clean up which moves all of the _host
structures out of gfs2_ondisk.h (where they should not have been to
start with) and into the places where they are actually used (often only
one place). Also those _host structures which are not required any more
are removed entirely (which is the eventual plan for all of them).

The conversion routines from ondisk.c are also moved into the places
where they are actually used, which for almost every one, was just one
single place, so all those are now static functions. This also cleans up
the end of gfs2_ondisk.h which no longer needs the #ifdef __KERNEL__.

The net result is a reduction of about 100 lines of code, many functions
now marked static plus the bug fixes as mentioned above. For good
measure I ran the code through sparse after making these changes to
check that there are no warnings generated.

This fixes Red Hat bz #239686

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# dbb7cae2 15-May-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Clean up inode number handling

This patch cleans up the inode number handling code. The main difference
is that instead of looking up the inodes using a struct gfs2_inum_host
we now use just the no_addr member of this structure. The tests relating
to no_formal_ino can then be done by the calling code. This has
advantages in that we want to do different things in different code
paths if the no_formal_ino doesn't match. In the NFS patch we want to
return -ESTALE, but in the ->lookup() path, its a bug in the fs if the
no_formal_ino doesn't match and thus we can withdraw in this case.

In order to later fix bz #201012, we need to be able to look up an inode
without knowing no_formal_ino, as the only information that is known to
us is the on-disk location of the inode in question.

This patch will also help us to fix bz #236099 at a later date by
cleaning up a lot of the code in that area.

There are no user visible changes as a result of this patch and there
are no changes to the on-disk format either.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 1be38679 01-Mar-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix bz 229831, lookup returns wrong inode

The following patch fixes Red Hat bz 229831. Without this patch its
possible for the wrong inode to be returned in certain cases. It is a
pretty unusual event, so that its taken some time to track down. Thanks
and due to Josef Whiter who did a lot of the testing required to thrack
this down and fix it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# fb0d3bce 28-Feb-2007 Wendy Cheng <wcheng@redhat.com>

[GFS2] pass formal ino in do_filldir_main

ok, the following is the minimum changes to get NFSD going before we
settle down this issue .. would appreciate this in the tree so other NFS
related works can get done in parallel.

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ddee7608 29-Jan-2007 Russell Cattelan <cattelan@redhat.com>

[GFS2] Fix unlink deadlocks

Move the glock acquisition to outside of the transactions.

Lock odering must be preserved in order to prevent ABBA
deadlocks. The current gfs2_change_nlink code would tries
to grab the glock after having started a transaction and thus is holding
the log lock. This is inconsistent with other code paths in
gfs that grab the resource group glock prior to staring
a tranactions.

One problem with this fix is that the resource group
lock is always grabbed now even if the inode still has
ref count and can not be marked for unlink.

Signed-off-by: Russell Cattelan <cattelan@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# d7c103d0 25-Jan-2007 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix recursive locking attempt with NFS

In certain cases, its possible for NFS to call the lookup code while
holding the glock (when doing a readdirplus operation) so we need to
check for that and not try and lock the glock twice. This also fixes a
typo in a previous NFS related GFS2 patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ddfe0627 18-Jan-2007 Eric Sandeen <sandeen@redhat.com>

[GFS2] use CURRENT_TIME_SEC instead of get_seconds in gfs2

I was looking something else up and came across this...

I don't honestly have a good reason to change it other than to make it
like every other Linux filesystem in this regard. ;-) It doesn't
functionally change anything, but makes some lines shorter. :)

I'm also curious; why does gfs2 have 64-bits of on-disk timestamps, but
not in timespec_t format, and only stores second resolutions? Seems like
you're halfway to sub-second resolutions already.

I suppose if that gets implemented then all of the below should
instead be CURRENT_TIME not CURRENT_TIME_SEC.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 03dc6a53 13-Jan-2007 Adrian Bunk <bunk@stusta.de>

[GFS2] make gfs2_change_nlink_i() static

On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote:
>...
> Changes since 2.6.20-rc3-mm1:
>...
> git-gfs2-nmw.patch
>...
> git trees
>...

This patch makes the needlessly globlal gfs2_change_nlink_i() static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 87d21e07 18-Jan-2007 S. Wendy Cheng <wcheng@redhat.com>

[GFS2] Fix gfs2_rename deadlock

Second round of gfs2_rename lock re-ordering to allow Anaconda adding
root partition on top of gfs2. Previous to this patch the recursive
lock detector in glock.c can be triggered due to attempting to lock
the rgrp twice. This fixes it by checking to see whether the rgrp
is already locked.

This fixes Red Hat bugzilla #221237

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 6c93fd1e 08-Jan-2007 Russell Cattelan <cattelan@redhat.com>

[GFS2] BZ 217008 fsfuzzer fix.

Update the quilt header comments to match the
code changes.

Change gfs2_lookup_simple to return an error in the case
of a NULL inode.
The callers of gfs2_lookup_simple do not check for NULL
in the no entry case and such would end up dereferencing a NULL ptr.

This fixes:
http://projects.info-pull.com/mokb/MOKB-15-11-2006.html

Signed-off-by: Russell Cattelan <cattelan@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 5509826f 18-Jan-2007 S. Wendy Cheng <wcheng@redhat.com>

[GFS2] Fix change nlink deadlock

Bugzilla 215088

Fix deadlock in gfs2_change_nlink() while installing RHEL5 into GFS2
partition. The gfs2_rename() apparently needs block allocation for the
new name (into the directory) where it requires rg locks. At the same
time, while updating the nlink count for the replaced file,
gfs2_change_nlink() tries to return the inode meta-data back to resource
group where it needs rg locks too. Our logic doesn't allow process to
acquire these locks recursively by the same process (RHEL installer)
that results a BUG call. This only happens within rename code path and
only if the destination file exists before the rename operation.

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 28626e20 22-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix glock ordering on inode creation

The lock order here should be parent -> child rather than
numeric order.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# dcd24799 16-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove unused function from inode.c

The gfs2_glock_nq_m_atime function is unused in so far as its only
ever called with num_gh = 1, and this falls through to the
gfs2_glock_nq_atime function, so we might as well call that directly.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 9e2dbdac 08-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove gfs2_inode_attr_in

This function wasn't really doing the right thing. There was no need
to update the inode size at this point and the updating of the
i_blocks field has now been moved to the places where di_blocks is
updated. A result of this patch and some those preceeding it is that
unlocking a glock is now a much more efficient process, since there
is no longer any requirement to copy data from the gfs2 inode into
the vfs inode at this point.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# e7c698d7 08-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Inode number is constant

Since the inode number is constant, we don't need to keep updating
it everytime we refresh the other inode fields.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 6b124d8d 07-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Only set inode flags when required

We were setting the inode flags from GFS2's flags far too often, even when they
couldn't possibly have changed. This patch reduces the amount of flag
setting going on so that we do it only when the inode is read in or
when the flags have changed. The create case is covered by the "when
the inode is read in" case.

This also fixes a bug where we didn't set S_SYNC correctly.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 294caaa3 02-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Tidy up 0 initialisations in inode.c

We don't need to use endian conversions for 0 initialisations
when creating a new on-disk inode.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# bfded27b 01-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Shrink gfs2_inode (8) - i_vn

This shrinks the size of the gfs2_inode by 8 bytes by
replacing the version counter with a one bit valid/invalid
flag.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# a9583c79 01-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Shrink gfs2_inode (7) - di_payload_format

This is almost never used. Its there for backward
compatibility with GFS1. It doesn't need its own
field since it can always be calculated from the
inode mode & flags. This saves a bit more space
in the gfs2_inode.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 1a7b1eed 01-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Shrink gfs2_inode (6) - di_atime/di_mtime/di_ctime

Remove the di_[amc]time fields and use inode->i_[amc]time
fields instead. This saves 24 bytes from the gfs2_inode.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 4f56110a 01-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Shrink gfs2_inode (5) - di_nlink

Remove the di_nlink field in favour of inode->i_nlink and
update the nlink handling to use the proper macros. This
saves 4 bytes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 2933f925 01-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Shrink gfs2_inode (4) - di_uid/di_gid

Remove duplicate di_uid/di_gid fields in favour of using
inode->i_uid/inode->i_gid instead. This saves 8 bytes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# b60623c2 31-Oct-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Shrink gfs2_inode (3) - di_mode

This removes the duplicate di_mode field in favour of using the
inode->i_mode field. This saves 4 bytes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# e7f14f4d 31-Oct-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Shrink gfs2_inode (2) - di_major/di_minor

This removes the device numbers from this structure by using
inode->i_rdev instead. It also cleans up the code in gfs2_mknod.
It results in shrinking the gfs2_inode by 8 bytes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# af339c02 01-Nov-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Shrink gfs2_inode (1) - di_header/di_num

The metadata header doesn't need to be stored in the incore
struct gfs2_inode since its constant, and this patch removes it.
Also, there is already a field for the inode's number in the
struct gfs2_inode, so we don't need one in struct gfs2_dinode_host
as well.

This saves 28 bytes of space in the struct gfs2_inode.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 4cc14f0b 31-Oct-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Change argument to gfs2_dinode_print

Change argument for gfs2_dinode_print in order to prepare
for removal of duplicate fields between struct inode and
struct gfs2_dinode_host.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# ea744d01 31-Oct-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Move gfs2_dinode_in to inode.c

gfs2_dinode_in() is only ever called from one place, so move it
to that place (in inode.c) and make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 891ea147 31-Oct-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Change argument to gfs2_dinode_in

This is a preliminary patch to enable the removal of fields
in gfs2_dinode_host which are duplicated in struct inode.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 539e5d6b 31-Oct-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Change argument of gfs2_dinode_out

Everywhere this was called, a struct gfs2_inode was available,
but despite that, it was always called with a struct gfs2_dinode
as an argument. By making this change it paves the way to start
eliminating fields duplicated between the kernel's struct inode
and the struct gfs2_dinode.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# b44b84d7 14-Oct-2006 Al Viro <viro@zeniv.linux.org.uk>

[GFS2] gfs2 misc endianness annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 629a21e7 13-Oct-2006 Al Viro <viro@zeniv.linux.org.uk>

[GFS2] split and annotate gfs2_inum

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# e6972647 13-Oct-2006 Al Viro <viro@zeniv.linux.org.uk>

[GFS2] split and annotate gfs2_inum_range

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 3ca68df6 13-Oct-2006 Al Viro <viro@zeniv.linux.org.uk>

[GFS2] split gfs2_dinode into on-disk and host variants

The latter is used as part of gfs2-private part of struct inode.
It actually stores a lot of fields differently; for now the
declaration is just cloned, inode field is swtiched and changes
propagated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 26d83ded 30-Oct-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix OOM error handling

Fix the OOM error handling in inode.c where it was possible for
a NULL pointer to be dereferenced.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# fcb47e0b 03-Oct-2006 Ryan O'Hara <rohara@redhat.com>

[GFS2] Initialize SELinux extended attributes at inode creation time.

This patch has gfs2_security_init declared as a static function, which
is correct. As a result, the declaration of this function in inode.h is
removed (and thus inode.h is unchanged). Also removed #include eaops.h,
which is not needed.

Signed-Off-By: Ryan O'Hara <rohara@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# f92a0b6f 02-Oct-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Mark nlink cleared so VFS sees it happen

This does nothing atm, but will be required for later support of
r/o bind mounts.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 48516ced 01-Oct-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove uneeded endian conversion

In many places GFS2 was calling the endian conversion routines
for an inode even when only a single field, or a few fields might
have changed. As a result we were copying lots of data needlessly.

This patch replaces those calls with conversion of just the
required fields in each case. This should be faster and easier
to understand. There are still other places which suffer from this
problem, but this is a start in the right direction.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 825f9075 27-Sep-2006 Theodore Ts'o <tytso@mit.edu>

[GFS2] inode-diet: Eliminate i_blksize from the inode structure

This eliminates the i_blksize field from struct inode. Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>


# bba9dfd8 27-Sep-2006 Theodore Ts'o <tytso@mit.edu>

[GFS2] inode_diet: Replace inode.u.generic_ip with inode.i_private (gfs)

The following patches reduce the size of the VFS inode structure by 28 bytes
on a UP x86. (It would be more on an x86_64 system). This is a 10% reduction
in the inode size on a UP kernel that is configured in a production mode
(i.e., with no spinlock or other debugging functions enabled; if you want to
save memory taken up by in-core inodes, the first thing you should do is
disable the debugging options; they are responsible for a huge amount of bloat
in the VFS inode structure).

This patch:

The filesystem or device-specific pointer in the inode is inside a union,
which is pretty pointless given that all 30+ users of this field have been
using the void pointer. Get rid of the union and rename it to i_private, with
a comment to explain who is allowed to use the void pointer. This is just a
cleanup, but it allows us to reuse the union 'u' for something something where
the union will actually be used.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>


# 907b9bce 25-Sep-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2/DLM] Fix trailing whitespace

As per Andrew Morton's request, removed trailing whitespace.

Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 7d308590 18-Sep-2006 Fabio Massimo Di Nitto <fabbione@ubuntu.com>

[GFS2] Export lm_interface to kernel headers


lm_interface.h has a few out of the tree clients such as GFS1
and userland tools.

Right now, these clients keeps a copy of the file in their build tree
that can go out of sync.

Move lm_interface.h to include/linux, export it to userland and
clean up fs/gfs2 to use the new location.

Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# c2668711 04-Sep-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove a cast, tidy gfs2_inode_attr_in

The remains of the changes for Jan Engelhardt's third email. Remove
a cast and tidy up gfs2_inode_attr_in.

Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# cd915493 03-Sep-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Change all types to uX style

This makes all fixed size types have consistent names.

Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 75d3b817 04-Sep-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Tidy up bmap/inode code

As per Jan Engelhardt's third set of comments, this make various
code style changes and moves the structures from format.h into
super.c, which was the only place that format.h was actually used.

Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# e9fc2aa0 01-Sep-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Update copyright, tidy up incore.h

As per comments from Jan Engelhardt <jengelh@linux01.gwdg.de> this
updates the copyright message to say "version" in full rather than
"v.2". Also incore.h has been updated to remove forward structure
declarations which are not required.

The gfs2_quota_lvb structure has now had endianess annotations added
to it. Also quota.c has been updated so that we now store the
lvb data locally in endian independant format to avoid needing
a structure in host endianess too. As a result the endianess
conversions are done as required at various points and thus the
conversion routines in lvb.[ch] are no longer required. I've
moved the one remaining constant in lvb.h thats used into lm.h
and removed the unused lvb.[ch].

I have not changed the HIF_ constants. That is left to a later patch
which I hope will unify the gh_flags and gh_iflags fields of the
struct gfs2_holder.

Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 420b9e5e 31-Jul-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Tidy up in various files

Tidy up some files and remove an unused routine in meta_io.h. Also
added a bit of extra debugging in meta_io.h.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# b2a580d8 09-Jul-2006 Abhijith Das <adas@redhat.com>

[PATCH] patch to init di_payload_format field in gfs2_dinode

A missing initialisation when creating a new on disk inode.

Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 4340fe62 11-Jul-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Add generation number

This adds a generation number for the eventual use of NFS to the
ondisk inode. Its backward compatible with the current code since
it doesn't really matter what the generation number is to start with,
and indeed since its set to zero, due to it being taken from padding
in both the inode and rgrp header, it should be fine.

The eventual plan is to use this rather than no_formal_ino in the
NFS filehandles. At that point no_formal_ino will be unused.

At the same time we also add a releasepages call back to the
"normal" address space for gfs2 inodes. Also I've removed a
one-linrer function thats not required any more.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 29937ac6 06-Jul-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fixes to scanning of glocks (again)

This really is the correct fix this time. We just ignore all
glocks associated with inodes until the inodes are pushed
from the inode cache. At that point the glocks are queued for
reclaim, so we don't need to do it here.

Also fix one or two other minor bugs.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# bdd512ae 22-Jun-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove unused flag

The flag GIF_MIN_INIT is no longer used or required.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# faf450ef 22-Jun-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove gfs2_repermission

gfs2_repermission is just a wrapper for permission, so remove it and
call permission directly where required.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# feaa7bba 14-Jun-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix unlinked file handling

This patch fixes the way we have been dealing with unlinked,
but still open files. It removes all limits (other than memory
for inodes, as per every other filesystem) on numbers of these
which we can support on GFS2. It also means that (like other
fs) its the responsibility of the last process to close the file
to deallocate the storage, rather than the person who did the
unlinking. Note that with GFS2, those two events might take place
on different nodes.

Also there are a number of other changes:

o We use the Linux inode subsystem as it was intended to be
used, wrt allocating GFS2 inodes
o The Linux inode cache is now the point which we use for
local enforcement of only holding one copy of the inode in
core at once (previous to this we used the glock layer).
o We no longer use the unlinked "special" file. We just ignore it
completely. This makes unlinking more efficient.
o We now use the 4th block allocation state. The previously unused
state is used to track unlinked but still open inodes.
o gfs2_inoded is no longer needed
o Several fields are now no longer needed (and removed) from the in
core struct gfs2_inode
o Several fields are no longer needed (and removed) from the in core
superblock

There are a number of future possible optimisations and clean ups
which have been made possible by this patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 320dd101 18-May-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] glock debugging and inode cache changes

This adds some extra debugging to glock.c and changes
inode.c's deallocation code to call the debugging code
at a suitable moment. I'm chasing down a particular bug
to do with deallocation at the moment and the code can
go again once the bug is fixed.

Also this includes the first part of some changes to unify
the Linux struct inode and GFS2's struct gfs2_inode. This
transformation will happen in small parts over the next short
period.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 3a8a9a10 18-May-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Update copyright date to 2006

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# bd896801 18-May-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove semaphore.h from C files

We no longer use semaphores, everything has been converted to
mutex or rwsem, so we don't need to include this header any more.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 64c14ea7 16-May-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix ref count bug that used to bite us on umount

The ref count of certain glock's got elevated too far during unlink
which caused umount to fail. This fixes it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 9801f646 12-May-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Remove incorrect initialisation of gh_owner

The gh_owner field shouldn't be set or reset outside the glock code.
These were left over from when recursive locking was allowed. It
isn't any more, so they are not needed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# fd88de56 05-May-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Readpages support

This adds readpages support (and also corrects a small bug in
the readpage error path at the same time). Hopefully this will
improve performance by allowing GFS to submit larger lumps of
I/O at a time.

In order to simplify the setting of BH_Boundary, it currently gets
set when we hit the end of a indirect pointer block. There is
always a boundary at this point with the current allocation code.
It doesn't get all the boundaries right though, so there is still
room for improvement in this.

See comments in fs/gfs2/ops_address.c for further information about
readpages with GFS2.

Signed-off-by: Steven Whitehouse


# 36327521 28-Apr-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Reordering in deallocation to avoid recursive locking

Despite my earlier careful search, there was a recursive lock left
in the deallocation code. This removes it. It also should speed up
deallocation be reducing the number of locking operations which take
place by using two "try lock" operations on the two locks involved in
inode deallocation which allows us to grab the locks out of order
(compared with NFS which grabs the inode lock first and the iopen
lock later). It is ok for us to fail while doing this since if it
does fail it means that someone else is still using the inode and
thus it wouldn't be possible to deallocate anyway.

This fixes the bug reported to me by Rob Kenna.

Cc: Rob Kenna <rkenna@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 190562bd 20-Apr-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix a bug: scheduling under a spinlock

At some stage, a mutex was added to gfs2_glock_put() without
checking all its call sites. Two of them were called from
under a spinlock causing random delays at various points and
crashes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 71b86f56 28-Mar-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Further updates to dir and logging code

This reduces the size of the directory code by about 3k and gets
readdir() to use the functions which were introduced in the previous
directory code update.

Two memory allocations are merged into one. Eliminates zeroing of some
buffers which were never used before they were initialised by
other data.

There is still scope for further improvement in the directory code.

On the logging side, a hand created mutex has been replaced by a
standard Linux mutex in the log allocation code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# c752666c 19-Mar-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix bug in directory code and tidy up

Due to a typo, the dir leaf split operation was (for the first
split in a directory) writing the new hash vaules at the
wrong offset. This is now fixed.

Also some other tidy ups are included:

- We use GFS2's hash function for dentries (see ops_dentry.c) so that
we don't have to keep recalculating the hash values.
- A lot of common code is eliminated between the various directory
lookup routines.
- Better error checking on directory lookup (previously different
routines checked for different errors)
- The leaf split operation has a couple of redundant operations
removed from it, so it should be faster.

There is still further scope for further clean ups in the directory
code, and readdir in particular could do with slimming down a bit.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# c9fd4307 01-Mar-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Tidy up mount code.

We no longer lookup ".gfs2_admin" in the root directory in order to
find it, but instead use the inode number given in the superblock.
Both the root directory and the admin directory are now looked up using
the same routine, so the redundant code is removed.

Also, there is no longer a reference to the root inode in the
GFS2 super block. When required this can be retreived via
sb->s_root->d_inode instead.

Assuming that we introduce a metadata filesystem type for GFS, then
this is a first step towards that goal.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 5c676f6d 27-Feb-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Macros removal in gfs2.h

As suggested by Pekka Enberg <penberg@cs.helsinki.fi>.

The DIV_RU macro is renamed DIV_ROUND_UP and and moved to kernel.h
The other macros are gone from gfs2.h as (although not requested
by Pekka Enberg) are a number of included header file which are now
included individually. The inode number comparison function is
now an inline function.

The DT2IF and IF2DT may be addressed in a future patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 568f4c96 26-Feb-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] 80 Column audit of GFS2

Requested by:
Prarit Bhargava <prarit@redhat.com>

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# f55ab26a 20-Feb-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Use mutices rather than semaphores

As well as a number of minor bug fixes, this patch changes GFS
to use mutices rather than semaphores. This results in better
information in case there are any locking problems.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 7359a19c 12-Feb-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Fix for root inode ref count bug

Umount is now working correctly again. The bug was due to
not getting an extra ref count when mounting the fs. We
should have bumped it by two (once for the internal pointer
to the root inode from the super block and once for the
inode hanging off the dcache entry for root).

Also this patch tidys up the code dealing with looking up
and creating inodes. We now pass Linux inodes (with gfs2_inodes
attached) rather than the other way around and this reduces code
duplication in various places.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# f42faf4f 30-Jan-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Add gfs2_internal_read()

Add the new external read function. Its temporarily in jdata.c
even though the protoype is in ops_file.h - this will change
shortly. The current implementation will change to a page cache
one when that happens.

In order to effect the above changes, the various internal inodes
now have Linux inodes attached to them. We keep the references to
the Linux inodes, rather than the gfs2_inodes in the super block.

In order to get everything to work correctly I've had to reorder
the init sequence on mount (which I should probably have done
earlier when .gfs2_admin was made visible).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 2442a098 30-Jan-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Bug fix relating to endian conversion in inode.c

A two line fix to get endian conversion correct.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# d4e9c4c3 18-Jan-2006 Steven Whitehouse <steve@chygwyn.com>

[GFS2] Add an additional argument to gfs2_trans_add_bh()

This adds an extra argument to gfs2_trans_add_bh() to indicate whether the
bh being added to the transaction is metadata or data. Its currently unused
since all existing callers set it to 1 (metadata) but following patches will
make use of it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# b96ca4fa 18-Jan-2006 Steven Whitehouse <steve@chygwyn.com>

[GFS2] Update init_dinode() to reduce stack usage

We no longer allocate a dinode on the stack in init_dinode()
and we no longer use gfs2_dinode_out (eliminating one copy) and
gfs2_meta_header_in (eliminating another copy). The meta_header_in
fucntion is now no longer referenced from outside gfs2_ondisk.c, so
make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# b3b94faa 16-Jan-2006 David Teigland <teigland@redhat.com>

[GFS2] The core of GFS2

This patch contains all the core files for GFS2.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>