History log of /linux-master/fs/ocfs2/super.c
Revision Date Author Comments
# 46bd9449 24-Feb-2024 Chengming Zhou <zhouchengming@bytedance.com>

ocfs2: remove SLAB_MEM_SPREAD flag usage

The SLAB_MEM_SPREAD flag is already a no-op as of 6.8-rc1, remove
its usage so we can delete it from slab. No functional change.

Link: https://lkml.kernel.org/r/20240224135008.829878-1-chengming.zhou@linux.dev
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ccb49011 06-Feb-2024 Jan Kara <jack@suse.cz>

quota: Properly annotate i_dquot arrays with __rcu

Dquots pointed to from i_dquot arrays in inodes are protected by
dquot_srcu. Annotate them as such and change .get_dquots callback to
return properly annotated pointer to make sparse happy.

Fixes: b9ba6f94b238 ("quota: remove dqptr_sem")
Signed-off-by: Jan Kara <jack@suse.cz>


# f88c3fb8 12-Mar-2024 Linus Torvalds <torvalds@linux-foundation.org>

mm, slab: remove last vestiges of SLAB_MEM_SPREAD

Yes, yes, I know the slab people were planning on going slow and letting
every subsystem fight this thing on their own. But let's just rip off
the band-aid and get it over and done with. I don't want to see a
number of unnecessary pull requests just to get rid of a flag that no
longer has any meaning.

This was mainly done with a couple of 'sed' scripts and then some manual
cleanup of the end result.

Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a4af51ce 06-Feb-2024 Kent Overstreet <kent.overstreet@linux.dev>

fs: super_set_uuid()

Some weird old filesytems have UUID-like things that we wish to expose
as UUIDs, but are smaller; add a length field so that the new
FS_IOC_(GET|SET)UUID ioctls can handle them in generic code.

And add a helper super_set_uuid(), for setting nonstandard length uuids.

Helper is now required for the new FS_IOC_GETUUID ioctl; if
super_set_uuid() hasn't been called, the ioctl won't be supported.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240207025624.1019754-2-kent.overstreet@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>


# a53fb69b 26-Jul-2023 Kees Cook <keescook@chromium.org>

ocfs2: use regular seq_show_option for osb_cluster_stack

While cleaning up seq_show_option_n()'s use of strncpy, it was noticed
that the osb_cluster_stack member is always NUL-terminated, so there is no
need to use the special seq_show_option_n() routine. Replace it with the
standard seq_show_option() routine.

Link: https://lkml.kernel.org/r/20230726215919.never.127-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 50d92788 22-May-2023 Luís Henriques <ocfs2-devel@oss.oracle.com>

ocfs2: fix use-after-free when unmounting read-only filesystem

It's trivial to trigger a use-after-free bug in the ocfs2 quotas code using
fstest generic/452. After a read-only remount, quotas are suspended and
ocfs2_mem_dqinfo is freed through ->ocfs2_local_free_info(). When unmounting
the filesystem, an UAF access to the oinfo will eventually cause a crash.

BUG: KASAN: slab-use-after-free in timer_delete+0x54/0xc0
Read of size 8 at addr ffff8880389a8208 by task umount/669
...
Call Trace:
<TASK>
...
timer_delete+0x54/0xc0
try_to_grab_pending+0x31/0x230
__cancel_work_timer+0x6c/0x270
ocfs2_disable_quotas.isra.0+0x3e/0xf0 [ocfs2]
ocfs2_dismount_volume+0xdd/0x450 [ocfs2]
generic_shutdown_super+0xaa/0x280
kill_block_super+0x46/0x70
deactivate_locked_super+0x4d/0xb0
cleanup_mnt+0x135/0x1f0
...
</TASK>

Allocated by task 632:
kasan_save_stack+0x1c/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x8b/0x90
ocfs2_local_read_info+0xe3/0x9a0 [ocfs2]
dquot_load_quota_sb+0x34b/0x680
dquot_load_quota_inode+0xfe/0x1a0
ocfs2_enable_quotas+0x190/0x2f0 [ocfs2]
ocfs2_fill_super+0x14ef/0x2120 [ocfs2]
mount_bdev+0x1be/0x200
legacy_get_tree+0x6c/0xb0
vfs_get_tree+0x3e/0x110
path_mount+0xa90/0xe10
__x64_sys_mount+0x16f/0x1a0
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 650:
kasan_save_stack+0x1c/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x50
__kasan_slab_free+0xf9/0x150
__kmem_cache_free+0x89/0x180
ocfs2_local_free_info+0x2ba/0x3f0 [ocfs2]
dquot_disable+0x35f/0xa70
ocfs2_susp_quotas.isra.0+0x159/0x1a0 [ocfs2]
ocfs2_remount+0x150/0x580 [ocfs2]
reconfigure_super+0x1a5/0x3a0
path_mount+0xc8a/0xe10
__x64_sys_mount+0x16f/0x1a0
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

Link: https://lkml.kernel.org/r/20230522102112.9031-1-lhenriques@suse.de
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# ce2fcf15 09-Nov-2022 Li Zetao <ocfs2-devel@oss.oracle.com>

ocfs2: fix memory leak in ocfs2_mount_volume()

There is a memory leak reported by kmemleak:

unreferenced object 0xffff88810cc65e60 (size 32):
comm "mount.ocfs2", pid 23753, jiffies 4302528942 (age 34735.105s)
hex dump (first 32 bytes):
10 00 00 00 00 00 00 00 00 01 01 01 01 01 01 01 ................
01 01 01 01 01 01 01 01 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8170f73d>] __kmalloc+0x4d/0x150
[<ffffffffa0ac3f51>] ocfs2_compute_replay_slots+0x121/0x330 [ocfs2]
[<ffffffffa0b65165>] ocfs2_check_volume+0x485/0x900 [ocfs2]
[<ffffffffa0b68129>] ocfs2_mount_volume.isra.0+0x1e9/0x650 [ocfs2]
[<ffffffffa0b7160b>] ocfs2_fill_super+0xe0b/0x1740 [ocfs2]
[<ffffffff818e1fe2>] mount_bdev+0x312/0x400
[<ffffffff819a086d>] legacy_get_tree+0xed/0x1d0
[<ffffffff818de82d>] vfs_get_tree+0x7d/0x230
[<ffffffff81957f92>] path_mount+0xd62/0x1760
[<ffffffff81958a5a>] do_mount+0xca/0xe0
[<ffffffff81958d3c>] __x64_sys_mount+0x12c/0x1a0
[<ffffffff82f26f15>] do_syscall_64+0x35/0x80
[<ffffffff8300006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

This call stack is related to two problems. Firstly, the ocfs2 super uses
"replay_map" to trace online/offline slots, in order to recover offline
slots during recovery and mount. But when ocfs2_truncate_log_init()
returns an error in ocfs2_mount_volume(), the memory of "replay_map" will
not be freed in error handling path. Secondly, the memory of "replay_map"
will not be freed if d_make_root() returns an error in ocfs2_fill_super().
But the memory of "replay_map" will be freed normally when completing
recovery and mount in ocfs2_complete_mount_recovery().

Fix the first problem by adding error handling path to free "replay_map"
when ocfs2_truncate_log_init() fails. And fix the second problem by
calling ocfs2_free_replay_slots(osb) in the error handling path
"out_dismount". In addition, since ocfs2_free_replay_slots() is static,
it is necessary to remove its static attribute and declare it in header
file.

Link: https://lkml.kernel.org/r/20221109074627.2303950-1-lizetao1@huawei.com
Fixes: 9140db04ef18 ("ocfs2: recover orphans in offline slots during recovery and mount")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# c97e21fe 18-Aug-2022 Wolfram Sang <wsa+renesas@sang-engineering.com>

ocfs2: move from strlcpy with unused retval to strscpy

Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220818210123.7637-4-wsa+renesas@sang-engineering.com
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 54d9171d 01-Sep-2022 Zhang Yi <yi.zhang@huawei.com>

ocfs2: replace ll_rw_block()

ll_rw_block() is not safe for the sync read path because it cannot
guarantee that submitting read IO if the buffer has been locked. We
could get false positive EIO after wait_on_buffer() if the buffer has
been locked by others. So stop using ll_rw_block() in ocfs2.

Link: https://lkml.kernel.org/r/20220901133505.2510834-9-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 550842cc 15-Aug-2022 Heming Zhao <ocfs2-devel@oss.oracle.com>

ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown

After commit 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job
before return error"), any procedure after ocfs2_dlm_init() fails will
trigger crash when calling ocfs2_dlm_shutdown().

ie: On local mount mode, no dlm resource is initialized. If
ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling will call
ocfs2_dlm_shutdown(), then does dlm resource cleanup job, which will
trigger kernel crash.

This solution should bypass uninitialized resources in
ocfs2_dlm_shutdown().

Link: https://lkml.kernel.org/r/20220815085754.20417-1-heming.zhao@suse.com
Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before return error")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 1420c4a5 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

fs/buffer: Combine two submit_bh() and ll_rw_block() arguments

Both submit_bh() and ll_rw_block() accept a request operation type and
request flags as their first two arguments. Micro-optimize these two
functions by combining these first two arguments into a single argument.
This patch does not change the behavior of any of the modified code.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Song Liu <song@kernel.org> (for the md changes)
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-48-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c80af0c2 03-Jun-2022 Junxiao Bi <ocfs2-devel@oss.oracle.com>

Revert "ocfs2: mount shared volume without ha stack"

This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144.

This commit introduced a regression that can cause mount hung. The
changes in __ocfs2_find_empty_slot causes that any node with none-zero
node number can grab the slot that was already taken by node 0, so node 1
will access the same journal with node 0, when it try to grab journal
cluster lock, it will hung because it was already acquired by node 0.
It's very easy to reproduce this, in one cluster, mount node 0 first, then
node 1, you will see the following call trace from node 1.

[13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
[13148.739691] Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
[13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13148.745846] task:mount.ocfs2 state:D stack: 0 pid:53045 ppid: 53044 flags:0x00004000
[13148.749354] Call Trace:
[13148.750718] <TASK>
[13148.752019] ? usleep_range+0x90/0x89
[13148.753882] __schedule+0x210/0x567
[13148.755684] schedule+0x44/0xa8
[13148.757270] schedule_timeout+0x106/0x13c
[13148.759273] ? __prepare_to_swait+0x53/0x78
[13148.761218] __wait_for_common+0xae/0x163
[13148.763144] __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
[13148.765780] ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
[13148.768312] ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
[13148.770968] ocfs2_journal_init+0x91/0x340 [ocfs2]
[13148.773202] ocfs2_check_volume+0x39/0x461 [ocfs2]
[13148.775401] ? iput+0x69/0xba
[13148.777047] ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
[13148.779646] ocfs2_fill_super+0x54b/0x853 [ocfs2]
[13148.781756] mount_bdev+0x190/0x1b7
[13148.783443] ? ocfs2_remount+0x440/0x440 [ocfs2]
[13148.785634] legacy_get_tree+0x27/0x48
[13148.787466] vfs_get_tree+0x25/0xd0
[13148.789270] do_new_mount+0x18c/0x2d9
[13148.791046] __x64_sys_mount+0x10e/0x142
[13148.792911] do_syscall_64+0x3b/0x89
[13148.794667] entry_SYSCALL_64_after_hwframe+0x170/0x0
[13148.797051] RIP: 0033:0x7f2309f6e26e
[13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
[13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
[13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
[13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
[13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
[13148.816564] </TASK>

To fix it, we can just fix __ocfs2_find_empty_slot. But original commit
introduced the feature to mount ocfs2 locally even it is cluster based,
that is a very dangerous, it can easily cause serious data corruption,
there is no way to stop other nodes mounting the fs and corrupting it.
Setup ha or other cluster-aware stack is just the cost that we have to
take for avoiding corruption, otherwise we have to do it in kernel.

Link: https://lkml.kernel.org/r/20220603222801.42488-1-junxiao.bi@oracle.com
Fixes: 912f655d78c5("ocfs2: mount shared volume without ha stack")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f1e75d12 29-Apr-2022 Heming Zhao via Ocfs2-devel <ocfs2-devel@oss.oracle.com>

ocfs2: rewrite error handling of ocfs2_fill_super

Current ocfs2_fill_super() uses one goto label "read_super_error" to
handle all error cases. And with previous serial patches, the error
handling should fork more branches to handle different error cases. This
patch rewrite the error handling of ocfs2_fill_super.

Link: https://lkml.kernel.org/r/20220424130952.2436-6-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 0737e01d 29-Apr-2022 Heming Zhao via Ocfs2-devel <ocfs2-devel@oss.oracle.com>

ocfs2: ocfs2_mount_volume does cleanup job before return error

After this patch, when error, ocfs2_fill_super doesn't take care to
release resources which are allocated in ocfs2_mount_volume.

Link: https://lkml.kernel.org/r/20220424130952.2436-5-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a8a986db 29-Apr-2022 Heming Zhao via Ocfs2-devel <ocfs2-devel@oss.oracle.com>

ocfs2: ocfs2_initialize_super does cleanup job before return error

After this patch, when error, ocfs2_fill_super doesn't take care to
release resources which are allocated in ocfs2_initialize_super.

Link: https://lkml.kernel.org/r/20220424130952.2436-4-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 54bd3f7c 29-Apr-2022 Heming Zhao via Ocfs2-devel <ocfs2-devel@oss.oracle.com>

ocfs2: change return type of ocfs2_resmap_init

Since ocfs2_resmap_init() always return 0, change it to void.

Link: https://lkml.kernel.org/r/20220424130952.2436-3-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# bb20b31d 29-Apr-2022 Heming Zhao via Ocfs2-devel <ocfs2-devel@oss.oracle.com>

ocfs2: fix mounting crash if journal is not alloced

Patch series "rewrite error handling during mounting stage".


This patch (of 5):

After commit da5e7c87827e8 ("ocfs2: cleanup journal init and shutdown"),
journal init later than before, it makes NULL pointer access in free
routine.

Crash flow:

ocfs2_fill_super
+ ocfs2_mount_volume
| + ocfs2_dlm_init //fail & return, osb->journal is NULL.
| + ...
| + ocfs2_check_volume //no chance to init osb->journal
|
+ ...
+ ocfs2_dismount_volume
ocfs2_release_system_inodes
...
evict
...
ocfs2_clear_inode
ocfs2_checkpoint_inode
ocfs2_ci_fully_checkpointed
time_after(journal->j_trans_id, ci->ci_last_trans)
+ journal is empty, crash!

For fixing, there are three solutions:

1> Partly revert commit da5e7c87827e8

For avoiding kernel crash, this make sense for us. We only
concerned whether there has any non-system inode access before dlm
init. The answer is NO. And all journal replay/recovery handling
happen after dlm & journal init done. So this method is not graceful
but workable.

2> Add osb->journal check in free inode routine (eg ocfs2_clear_inode)

The fix code is special for mounting phase, but it will continue
working after mounting stage. In another word, this method adds
useless code in normal inode free flow.

3> Do directly free inode in mounting phase

This method is brutal/complex and may introduce unsafe code,
currently maintainer didn't like.

At last, we chose method <1> and did partly reverted job. We reverted
journal init codes, and kept cleanup codes flow.

Link: https://lkml.kernel.org/r/20220424130952.2436-1-heming.zhao@suse.com
Link: https://lkml.kernel.org/r/20220424130952.2436-2-heming.zhao@suse.com
Fixes: da5e7c87827e8 ("ocfs2: cleanup journal init and shutdown")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# fd60b288 22-Mar-2022 Muchun Song <songmuchun@bytedance.com>

fs: allocate inode by using alloc_inode_sb()

The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().

Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7b0b1332 16-Mar-2022 Joseph Qi <joseph.qi@linux.alibaba.com>

ocfs2: fix crash when initialize filecheck kobj fails

Once s_root is set, genric_shutdown_super() will be called if
fill_super() fails. That means, we will call ocfs2_dismount_volume()
twice in such case, which can lead to kernel crash.

Fix this issue by initializing filecheck kobj before setting s_root.

Link: https://lkml.kernel.org/r/20220310081930.86305-1-joseph.qi@linux.alibaba.com
Fixes: 5f483c4abb50 ("ocfs2: add kobject for online file check")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0a4ee518 21-Jan-2022 Christoph Hellwig <hch@lst.de>

mm: remove cleancache

Patch series "remove Xen tmem leftovers".

Since the removal of the Xen tmem driver in 2019, the cleancache hooks
are entirely unused, as are large parts of frontswap. This series
against linux-next (with the folio changes included) removes
cleancaches, and cuts down frontswap to the bits actually used by zswap.

This patch (of 13):

The cleancache subsystem is unused since the removal of Xen tmem driver
in commit 814bbf49dcd0 ("xen: remove tmem driver").

[akpm@linux-foundation.org: remove now-unreachable code]

Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de
Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# da5e7c87 05-Nov-2021 Valentin Vidic <vvidic@valentin-vidic.from.hr>

ocfs2: cleanup journal init and shutdown

Allocate and free struct ocfs2_journal in ocfs2_journal_init and
ocfs2_journal_shutdown. Init and release of system inodes references
the journal so reorder calls to make sure they work correctly.

Link: https://lkml.kernel.org/r/20211009145006.3478-1-vvidic@valentin-vidic.from.hr
Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b15fa922 18-Oct-2021 Valentin Vidic <vvidic@valentin-vidic.from.hr>

ocfs2: mount fails with buffer overflow in strlen

Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an
ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the
trace below. Problem seems to be that strings for cluster stack and
cluster name are not guaranteed to be null terminated in the disk
representation, while strlcpy assumes that the source string is always
null terminated. This causes a read outside of the source string
triggering the buffer overflow detection.

detected buffer overflow in strlen
------------[ cut here ]------------
kernel BUG at lib/string.c:1149!
invalid opcode: 0000 [#1] SMP PTI
CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1
Debian 5.14.6-2
RIP: 0010:fortify_panic+0xf/0x11
...
Call Trace:
ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2]
ocfs2_fill_super+0x359/0x19b0 [ocfs2]
mount_bdev+0x185/0x1b0
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
path_mount+0x454/0xa20
__x64_sys_mount+0x103/0x140
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hr
Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fa60ce2c 06-May-2021 Masahiro Yamada <masahiroy@kernel.org>

treewide: remove editor modelines and cruft

The section "19) Editor modelines and other cruft" in
Documentation/process/coding-style.rst clearly says, "Do not include any
of these in source files."

I recently receive a patch to explicitly add a new one.

Let's do treewide cleanups, otherwise some people follow the existing code
and attempt to upstream their favoriate editor setups.

It is even nicer if scripts/checkpatch.pl can check it.

If we like to impose coding style in an editor-independent manner, I think
editorconfig (patch [1]) is a saner solution.

[1] https://lore.kernel.org/lkml/20200703073143.423557-1-danny@kdrag0n.dev/

Link: https://lkml.kernel.org/r/20210324054457.1477489-1-masahiroy@kernel.org
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org> [auxdisplay]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6efb5949 24-Feb-2021 Yi Li <yili@winhong.com>

ocfs2: remove redundant conditional before iput

iput handles NULL pointers gracefully, so there's no need to check the
pointer before the call.

Link: https://lkml.kernel.org/r/20201231040535.4091761-1-yili@winhong.com
Signed-off-by: Yi Li <yili@winhong.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f5785283 13-Nov-2020 Wengang Wang <wen.gang.wang@oracle.com>

ocfs2: initialize ip_next_orphan

Though problem if found on a lower 4.1.12 kernel, I think upstream has
same issue.

In one node in the cluster, there is the following callback trace:

# cat /proc/21473/stack
__ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
ocfs2_evict_inode+0x152/0x820 [ocfs2]
evict+0xae/0x1a0
iput+0x1c6/0x230
ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
ocfs2_dir_foreach+0x29/0x30 [ocfs2]
ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
process_one_work+0x169/0x4a0
worker_thread+0x5b/0x560
kthread+0xcb/0xf0
ret_from_fork+0x61/0x90

The above stack is not reasonable, the final iput shouldn't happen in
ocfs2_orphan_filldir() function. Looking at the code,

2067 /* Skip inodes which are already added to recover list, since dio may
2068 * happen concurrently with unlink/rename */
2069 if (OCFS2_I(iter)->ip_next_orphan) {
2070 iput(iter);
2071 return 0;
2072 }
2073

The logic thinks the inode is already in recover list on seeing
ip_next_orphan is non-NULL, so it skip this inode after dropping a
reference which incremented in ocfs2_iget().

While, if the inode is already in recover list, it should have another
reference and the iput() at line 2070 should not be the final iput
(dropping the last reference). So I don't think the inode is really in
the recover list (no vmcore to confirm).

Note that ocfs2_queue_orphans(), though not shown up in the call back
trace, is holding cluster lock on the orphan directory when looking up
for unlinked inodes. The on disk inode eviction could involve a lot of
IOs which may need long time to finish. That means this node could hold
the cluster lock for very long time, that can lead to the lock requests
(from other nodes) to the orhpan directory hang for long time.

Looking at more on ip_next_orphan, I found it's not initialized when
allocating a new ocfs2_inode_info structure.

This causes te reflink operations from some nodes hang for very long
time waiting for the cluster lock on the orphan directory.

Fix: initialize ip_next_orphan as NULL.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 38d51b2d 07-Aug-2020 Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: change slot number type s16 to u16

Dan Carpenter reported the following static checker warning.

fs/ocfs2/super.c:1269 ocfs2_parse_options() warn: '(-1)' 65535 can't fit into 32767 'mopt->slot'
fs/ocfs2/suballoc.c:859 ocfs2_init_inode_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_inode_steal_slot'
fs/ocfs2/suballoc.c:867 ocfs2_init_meta_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_meta_steal_slot'

That's because OCFS2_INVALID_SLOT is (u16)-1. Slot number in ocfs2 can be
never negative, so change s16 to u16.

Fixes: 9277f8334ffc ("ocfs2: fix value of OCFS2_INVALID_SLOT")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200627001259.19757-1-junxiao.bi@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 912f655d 01-Jun-2020 Gang He <ghe@suse.com>

ocfs2: mount shared volume without ha stack

Usually we create and use a ocfs2 shared volume on the top of ha stack.
For pcmk based ha stack, which includes DLM, corosync and pacemaker
services.

The customers complained they could not mount existent ocfs2 volume in
the single node without ha stack, e.g. single node backup/restore
scenario.

Like this case, the customers just want to access the data from the
existent ocfs2 volume quickly, but do not want to restart or setup ha
stack.

Then, I'd like to add a mount option "nocluster", if the users use this
option to mount a ocfs2 shared volume, the whole mount will not depend
on the ha related services. the command will mount the existent ocfs2
volume directly (like local mount), for avoiding setup the ha stack.

Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200423053300.22661-1-ghe@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d293d3af 01-Apr-2020 Takashi Iwai <tiwai@suse.de>

ocfs2: use scnprintf() for avoiding potential buffer overflow

Since snprintf() returns the would-be-output size instead of the actual
output size, the succeeding calls may go beyond the given buffer limit.
Fix it by replacing with scnprintf().

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200311093516.25300-1-tiwai@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7212b95e 01-Nov-2019 Jan Kara <jack@suse.cz>

fs: Use dquot_load_quota_inode() from filesystems

Use dquot_load_quota_inode from filesystems instead of dquot_enable().
In all three cases we want to load quota inode and never use the
function to update quota flags.

Signed-off-by: Jan Kara <jack@suse.cz>


# 5e7a3ed9 23-Sep-2019 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ocfs2: further debugfs cleanups

There is no need to check return value of debugfs_create functions, but
the last sweep through ocfs missed a number of places where this was
happening. There is also no need to save the individual dentries for the
debugfs files, as everything is can just be removed at once when the
directory is removed.

By getting rid of the file dentries for the debugfs entries, a bit of
local memory can be saved as well.

[colin.king@canonical.com: ensure ret is set to zero before returning]
Link: http://lkml.kernel.org/r/20190807121929.28918-1-colin.king@canonical.com
Link: http://lkml.kernel.org/r/20190731132119.GA12603@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jia Guo <guojia12@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e581595e 11-Jul-2019 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ocfs: no need to check return value of debugfs_create functions

When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.

Also, because there is no need to save the file dentry, remove all of
the variables that were being saved, and just recursively delete the
whole directory when shutting down, saving a lot of logic and local
variables.

[gregkh@linuxfoundation.org: v2]
Link: http://lkml.kernel.org/r/20190613055455.GE19717@kroah.com
Link: http://lkml.kernel.org/r/20190612152912.GA19151@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jia Guo <guojia12@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 328970de 23-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 145

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
59 temple place suite 330 boston ma 021110 1307 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 84 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190524100844.756442981@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# e91b9194 15-Apr-2019 Al Viro <viro@zeniv.linux.org.uk>

ocfs2: switch to ->free_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 72deb455 05-Apr-2019 Christoph Hellwig <hch@lst.de>

block: remove CONFIG_LBDAF

Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures. These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time. Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5500ab4e 05-Mar-2019 Gang He <ghe@suse.com>

ocfs2: fix the application IO timeout when fstrim is running

The user reported this problem, the upper application IO was timeout
when fstrim was running on this ocfs2 partition. the application
monitoring resource agent considered that this application did not work,
then this node was fenced by the cluster brain (e.g. pacemaker).

The root cause is that fstrim thread always holds main_bm meta-file
related locks until all the cluster groups are trimmed. This patch will
make fstrim thread release main_bm meta-file related locks when each
cluster group is trimmed, this will let the current application IO has a
chance to claim the clusters from main_bm meta-file.

Link: http://lkml.kernel.org/r/20190111090014.31645-1-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5f483c4a 05-Apr-2018 Gang He <ghe@suse.com>

ocfs2: add kobject for online file check

Use embedded kobject mechanism for online file check feature, this will
avoid to use a global list to save/search per-device online file check
related data, meanwhile, reduce the code lines and make the code logic
clear. The changed code is based on Goldwyn Rodrigues's patches and
ext4 fs code.

Link: http://lkml.kernel.org/r/1495611866-27360-4-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a17b485a 05-Apr-2018 piaojun <piaojun@huawei.com>

ocfs2: remove unnecessary null pointer check before kmem_cache_destroy()

As kmem_cache_destroy() already handles null pointers, so we can remove
the conditional test entirely.

Link: http://lkml.kernel.org/r/5A9EB21D.3000209@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1119d3c0 05-Apr-2018 piaojun <piaojun@huawei.com>

ocfs2: use 'osb' instead of 'OCFS2_SB()'

We could use 'osb' instead of 'OCFS2_SB()' to make code more elegant.

Link: http://lkml.kernel.org/r/5A702111.7090907@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 025bcbde 31-Jan-2018 piaojun <piaojun@huawei.com>

ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid

If metadata is corrupted such as 'invalid inode block', we will get
failed by calling 'mount()' and then set filesystem readonly as below:

ocfs2_mount
ocfs2_initialize_super
ocfs2_init_global_system_inodes
ocfs2_iget
ocfs2_read_locked_inode
ocfs2_validate_inode_block
ocfs2_error
ocfs2_handle_error
ocfs2_set_ro_flag(osb, 0); // set readonly

In this situation we need return -EROFS to 'mount.ocfs2', so that user
can fix it by fsck. And then mount again. In addition, 'mount.ocfs2'
should be updated correspondingly as it only return 1 for all errno.
And I will post a patch for 'mount.ocfs2' too.

Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a52370b3 31-Jan-2018 Gang He <ghe@suse.com>

ocfs2: give an obvious tip for mismatched cluster names

Add an obvious error message, due to mismatched cluster names between
on-disk and in the current cluster. We can meet this case during OCFS2
cluster migration.

If we can give the user an obvious tip for why they can not mount the
file system after migration, they can quickly fix this mismatch problem.

Second, also move printing ocfs2_fill_super() errno to the front of
ocfs2_dismount_volume(), since ocfs2_dismount_volume() will also print
its own message.

I looked through all the code of OCFS2 (include o2cb); there is not any
place which returns this error. In fact, the function calling path
ocfs2_fill_super -> ocfs2_mount_volume -> ocfs2_dlm_init ->
dlm_new_lockspace is a very specific one. We can use this errno to give
the user a more clear tip, since this case is a little common during
cluster migration, but the customer can quickly get the failure cause if
there is a error printed. Also, I think it is not possible to add this
errno in the o2cb path during ocfs2_dlm_init(), since the o2cb code has
been stable for a long time.

We only print this error tip when the user uses pcmk stack, since using
the o2cb stack the user will not meet this error.

[ghe@suse.com: v2]
Link: http://lkml.kernel.org/r/1495419305-3780-1-git-send-email-ghe@suse.com
Link: http://lkml.kernel.org/r/1495089336-19312-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1751e8a6 27-Nov-2017 Linus Torvalds <torvalds@linux-foundation.org>

Rename superblock flags (MS_xyz -> SB_xyz)

This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"

SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 23e0813a 15-Nov-2017 piaojun <piaojun@huawei.com>

ocfs2: no need flush workqueue before destroying it

destroy_workqueue() will do flushing work for us.

Link: http://lkml.kernel.org/r/59E06476.3090502@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 964f14a0 06-Sep-2017 Jun Piao <piaojun@huawei.com>

ocfs2: clean up some dead code

clean up some unused functions and parameters.

Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bc98a42c 17-Jul-2017 David Howells <dhowells@redhat.com>

VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)

Firstly by applying the following with coccinelle's spatch:

@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)

@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)

to remove left over excess bracketage and finally by applying:

@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>


# 85787090 10-May-2017 Christoph Hellwig <hch@lst.de>

fs: switch ->s_uuid to uuid_t

For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers. More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>


# f361bf4a 03-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for the reduction of <linux/sched.h>'s signal API dependency

Instead of including the full <linux/signal.h>, we are going to include the
types-only <linux/signal_types.h> header in <linux/sched.h>, to further
decouple the scheduler header from the signal headers.

This means that various files which relied on the full <linux/signal.h> need
to be updated to gain an explicit dependency on it.

Update the code that relies on sched.h's inclusion of the <linux/signal.h> header.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 395627b0 12-Dec-2016 Deepa Dinamani <deepa.kernel@gmail.com>

ocfs2: use time64_t to represent orphan scan times

struct timespec is not y2038 safe. Use time64_t which is y2038 safe to
represent orphan scan times. time64_t is sufficient here as only the
seconds delta times are relevant.

Also use appropriate time functions that return time in time64_t format.
Time functions now return monotonic time instead of real time as only
delta scan times are relevant and these values are not persistent across
reboots.

The format string for the debug print is still using long as this is
only the time elapsed since the last scan and long is sufficient to
represent this value.

Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2a64f80e 23-Nov-2016 Jan Kara <jack@suse.cz>

ocfs2: Protect periodic quota syncing with s_umount semaphore

New quota locking rules will require s_umount semaphore for all quota
scanning functions. Add is for periodic quota syncing.

Tested-by: Eric Ren <zren@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>


# 44be9756 07-Oct-2016 Bhaktipriya Shridhar <bhaktipriya96@gmail.com>

fs/ocfs2/super: remove deprecated create_singlethread_workqueue()

The workqueue "ocfs2_wq" queues multiple work items viz
&osb->la_enable_wq, &journal->j_recovery_work, &os->os_orphan_scan_work,
&osb->osb_truncate_log_wq which require strict execution ordering. Hence,
an ordered dedicated workqueue has been used.

WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure because the workqueue is being used on a memory reclaim path.

Link: http://lkml.kernel.org/r/66279de510a7f4cfc6e386d99b7e04b3f65fb11b.1472590094.git.bhaktipriya96@gmail.com
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 698d44b4 26-Jul-2016 Joseph Qi <joseph.qi@huawei.com>

ocfs2: cleanup implemented prototypes

Several prototypes in inode.h are just defined but not actually
implemented and used, so remove them.

Link: http://lkml.kernel.org/r/57763787.4020706@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dfec8a14 05-Jun-2016 Mike Christie <mchristi@redhat.com>

fs: have ll_rw_block users pass in op and flags separately

This has ll_rw_block users pass in the operation and flags separately,
so ll_rw_block can setup the bio op and bi_rw flags on the bio that
is submitted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 35ddf78e 25-Mar-2016 jiangyiwen <jiangyiwen@huawei.com>

ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local

This patch fixes a deadlock, as follows:

Node 1 Node 2 Node 3
1)volume a and b are only mount vol a only mount vol b
mounted

2) start to mount b start to mount a

3) check hb of Node 3 check hb of Node 2
in vol a, qs_holds++ in vol b, qs_holds++

4) -------------------- all nodes' network down --------------------

5) progress of mount b the same situation as
failed, and then call Node 2
ocfs2_dismount_volume.
but the process is hung,
since there is a work
in ocfs2_wq cannot beo
completed. This work is
about vol a, because
ocfs2_wq is global wq.
BTW, this work which is
scheduled in ocfs2_wq is
ocfs2_orphan_scan_work,
and the context in this work
needs to take inode lock
of orphan_dir, because
lockres owner are Node 1 and
all nodes' nework has been down
at the same time, so it can't
get the inode lock.

6) Why can't this node be fenced
when network disconnected?
Because the process of
mount is hung what caused qs_holds
is not equal 0.

Because all works in the ocfs2_wq are relative to the super block.

The solution is to change the ocfs2_wq from global to local. In other
words, move it into struct ocfs2_super.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e63890f3 25-Mar-2016 Ryan Ding <ryan.ding@oracle.com>

ocfs2: fix ip_unaligned_aio deadlock with dio work queue

In the current implementation of unaligned aio+dio, lock order behave as
follow:

in user process context:
-> call io_submit()
-> get i_mutex
<== window1
-> get ip_unaligned_aio
-> submit direct io to block device
-> release i_mutex
-> io_submit() return

in dio work queue context(the work queue is created in __blockdev_direct_IO):
-> release ip_unaligned_aio
<== window2
-> get i_mutex
-> clear unwritten flag & change i_size
-> release i_mutex

There is a limitation to the thread number of dio work queue. 256 at
default. If all 256 thread are in the above 'window2' stage, and there
is a user process in the 'window1' stage, the system will became
deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio
lock, while there is a direct bio hold ip_unaligned_aio mutex who is
waiting for a dio work queue thread to be schedule. But all the dio
work queue thread is waiting for i_mutex lock in 'window2'.

This case only happened in a test which send a large number(more than
256) of aio at one io_submit() call.

My design is to remove ip_unaligned_aio lock. Change it to a sync io
instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
dio.

[akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4506cfb6 25-Mar-2016 Ryan Ding <ryan.ding@oracle.com>

ocfs2: record UNWRITTEN extents when populate write desc

To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.

There is still one issue in the direct write procedure.

phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk

When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first,
it will zero the region (4~7KB). Before request A enter to phase 3,
request B arrive phase 2, it will zero region (0~3KB). This is just like
request B steps request A.

To resolve this issue, we should let request B knows this cluster is already
under zero, to prevent it from steps the previous write request.

This patch will add function ocfs2_unwritten_check() to do this job. It
will record all clusters that are under direct write(it will be recorded
in the 'ip_unwritten_list' member of inode info), and prevent the later
direct write writing to the same cluster to do the zero work again.

Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a849d468 22-Mar-2016 Gang He <ghe@suse.com>

ocfs2: create/remove sysfile for online file check

Create online file check sysfile when ocfs2 mount, remove the related
sysfile when ocfs2 umount.

Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bfd97a03 15-Mar-2016 jiangyiwen <jiangyiwen@huawei.com>

ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()

Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.

ocfs2_wake_downconvert_thread
ocfs2_rw_unlock
ocfs2_dio_end_io
dio_complete
.....
bio_endio
req_bio_endio
....
scsi_io_completion
blk_done_softirq
__do_softirq
do_softirq
irq_exit
do_IRQ
ocfs2_osb_dump
cat /sys/kernel/debug/ocfs2/${uuid}/fs_state

This patch still uses spin_lock_irqsave() - replace spin_lock() to solve
this situation.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5d097056 14-Jan-2016 Vladimir Davydov <vdavydov.dev@gmail.com>

kmemcg: account certain kmem allocations to memcg

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:

- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 72865d92 14-Jan-2016 Joseph Qi <joseph.qi@huawei.com>

ocfs2: clean up redundant NULL check before iput

Since iput will take care the NULL check itself, NULL check before
calling it is redundant. So clean them up.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3eb5bdf0 14-Jan-2016 Norton.Zhu <norton.zhu@huawei.com>

ocfs2: optimize bad declarations and redundant assignment

In ocfs2_parse_options,

a) it's better to declare variables(small size) outside of while loop;

b) 'option' will be set by match_int, 'option = 0;' makes no sense, if
match_int failed, it just goto bail and return.

Signed-off-by: Norton.Zhu <norton.zhu@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a068acf2 04-Sep-2015 Kees Cook <keescook@chromium.org>

fs: create and use seq_show_option for escaping

Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:

$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ecef14a 04-Sep-2015 Joe Perches <joe@perches.com>

ocfs2: neaten do_error, ocfs2_error and ocfs2_abort

These uses sometimes do and sometimes don't have '\n' terminations. Make
the uses consistently use '\n' terminations and remove the newline from
the functions.

Miscellanea:

o Coalesce formats
o Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7d0fb914 04-Sep-2015 Goldwyn Rodrigues <rgoldwyn@suse.de>

ocfs2: add errors=continue

OCFS2 is often used in high-availaibility systems. However, ocfs2
converts the filesystem to read-only at the drop of the hat. This may
not be necessary, since turning the filesystem read-only would affect
other running processes as well, decreasing availability.

This attempt is to add errors=continue, which would return the EIO to
the calling process and terminate furhter processing so that the
filesystem is not corrupted further. However, the filesystem is not
converted to read-only.

As a future plan, I intend to create a small utility or extend
fsck.ocfs2 to fix small errors such as in the inode. The input to the
utility such as the inode can come from the kernel logs so we don't have
to schedule a downtime for fixing small-enough errors.

The patch changes the ocfs2_error to return an error. The error
returned depends on the mount option set. If none is set, the default
is to turn the filesystem read-only.

Perhaps errors=continue is not the best option name. Historically it is
used for making an attempt to progress in the current process itself.
Should we call it errors=eio? or errors=killproc? Suggestions/Comments
welcome.

Sources are available at:
https://github.com/goldwynr/linux/tree/error-cont

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 512f62ac 04-Sep-2015 Joseph Qi <joseph.qi@huawei.com>

ocfs2: fix race between dio and recover orphan

During direct io the inode will be added to orphan first and then
deleted from orphan. There is a race window that the orphan entry will
be deleted twice and thus trigger the BUG when validating
OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.

ocfs2_direct_IO_write
...
ocfs2_add_inode_to_orphan
>>>>>>>> race window.
1) another node may rm the file and then down, this node
take care of orphan recovery and clear flag
OCFS2_DIO_ORPHANED_FL.
2) since rw lock is unlocked, it may race with another
orphan recovery and append dio.
ocfs2_del_inode_from_orphan

So take inode mutex lock when recovering orphans and make rw unlock at the
end of aio write in case of append dio.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8f443e23 21-Apr-2015 Linus Torvalds <torvalds@linux-foundation.org>

Revert "ocfs2: incorrect check for debugfs returns"

This reverts commit e2ac55b6a8e337fac7cc59c6f452caac92ab5ee6.

Huang Ying reports that this causes a hang at boot with debugfs disabled.

It is true that the debugfs error checks are kind of confusing, and this
code certainly merits more cleanup and thinking about it, but there's
something wrong with the trivial "check not just for NULL, but for error
pointers too" patch.

Yes, with debugfs disabled, we will end up setting the o2hb_debug_dir
pointer variable to an error pointer (-ENODEV), and then continue as if
everything was fine. But since debugfs is disabled, all the _users_ of
that pointer end up being compiled away, so even though the pointer can
not be dereferenced, that's still fine.

So it's confusing and somewhat questionable, but the "more correct"
error checks end up causing more trouble than they fix.

Reported-by: Huang Ying <ying.huang@intel.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chengyu Song <csong84@gatech.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9de16262 14-Apr-2015 Vladimir Davydov <vdavydov.dev@gmail.com>

cleancache: zap uuid arg of cleancache_init_shared_fs

Use super_block->s_uuid instead. Every shared filesystem using cleancache
must now initialize super_block->s_uuid before calling
cleancache_init_shared_fs. The only one on the tree, ocfs2, already meets
this requirement.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 58be19dc 14-Apr-2015 Vladimir Davydov <vdavydov.dev@gmail.com>

ocfs2: copy fs uuid to superblock

Currently, maximal number of cleancache enabled filesystems equals 32,
which is insufficient nowadays, because a Linux host can have hundreds
of containers on board, each of which might want its own filesystem.
This patch set targets at removing this limitation - see patch 4 for
more details. Patches 1-3 prepare the code for this change.

This patch (of 4):

This will allow us to remove the uuid argument from
cleancache_init_shared_fs.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1543306e 14-Apr-2015 Joe Perches <joe@perches.com>

ocfs2: logging: remove static buffer, use vsprintf extension %pV

Use the vsprintf %pV extension to avoid using a static buffer and remove
the now unnecessary buffer.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e2ac55b6 14-Apr-2015 Chengyu Song <csong84@gatech.edu>

ocfs2: incorrect check for debugfs returns

debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs
is not configured, so the return value should be checked against
ERROR_VALUE as well, otherwise the later dereference of the dentry pointer
would crash the kernel.

This patch tries to solve this problem by fixing certain checks. However,
I have that found other call sites are protected by #ifdef CONFIG_DEBUG_FS.
In current implementation, if CONFIG_DEBUG_FS is defined, then the above
two functions will never return any ERROR_VALUE. So another possibility
to fix this is to surround all the buggy checks/functions with the same
#ifdef CONFIG_DEBUG_FS. But I'm not sure if this would break any functionality,
as only OCFS2_FS_STATS declares dependency on DEBUG_FS.

Signed-off-by: Chengyu Song <csong84@gatech.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4813962b 16-Feb-2015 Joseph Qi <joseph.qi@huawei.com>

ocfs2: wait for orphan recovery first once append O_DIRECT write crash

If one node has crashed with orphan entry leftover, another node which do
append O_DIRECT write to the same file will override the
i_dio_orphaned_slot. Then the old entry won't be cleaned forever. If
this case happens, we let it wait for orphan recovery first.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1dfeb768 10-Feb-2015 alex chen <alex.chen@huawei.com>

ocfs2: add a mount option journal_async_commit on ocfs2 filesystem

Add a mount option to support JBD2 feature:

JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT. When this feature is opened, journal
commit block can be written to disk without waiting for descriptor blocks,
which can improve journal commit performance. This option will enable
'journal_checksum' internally.

Using the fs_mark benchmark, using journal_async_commit shows a 50%
improvement, the files per second go up from 215.2 to 317.5.

test script:
fs_mark -d /mnt/ocfs2/ -s 10240 -n 1000

default:
FSUse% Count Size Files/sec App Overhead
0 1000 10240 215.2 17878

with journal_async_commit option:
FSUse% Count Size Files/sec App Overhead
0 1000 10240 317.5 17881

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Signed-off-by: Weiwei Wang <wangww631@huawei.comm>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 664dbd5f 08-Oct-2014 Jan Kara <jack@suse.cz>

ocfs2: Use generic helpers for quotaon and quotaoff

Ocfs2 can just use the generic helpers provided by quota code for
turning quotas on and off when quota files are stored as system inodes.
The only difference is the feature test in ocfs2_quota_on() and that is
covered by dquot_quota_enable() checking whether usage tracking is
enabled (which can happen only if the filesystem has the quota feature
set).

Signed-off-by: Jan Kara <jack@suse.cz>


# dc171580 10-Dec-2014 Jan Kara <jack@suse.cz>

ocfs2: fix error handling when creating debugfs root in ocfs2_init()

Error handling if creation of root of debugfs in ocfs2_init() fails is
broken. Although error code is set we fail to exit ocfs2_init() with
error and thus initialization ends with success. Later when mounting a
filesystem, ocfs2 debugfs entries end up being created in the root of
debugfs filesystem which is confusing.

Fix the error handling to bail out.

Coverity id: 1227009.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1c92ec67 29-Sep-2014 Jan Kara <jack@suse.cz>

ocfs2: Convert to private i_dquot field

CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# f13a568e 25-Sep-2014 Joseph Qi <joseph.qi@huawei.com>

ocfs2: free vol_label in ocfs2_delete_osb()

osb->vol_label is malloced in ocfs2_initialize_super but not freed if
error occurs or during umount, thus causing a memory leak.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52362810 10-Sep-2014 Jan Kara <jack@suse.cz>

ocfs2: Don't use MAXQUOTAS value

MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types ocfs2 supports and with
addition of project quotas these two numbers stop matching. So make
ocfs2 use its private definition.

CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>


# b253bfd8 23-Jun-2014 Xue jiufei <xuejiufei@huawei.com>

ocfs2: revert "ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously"

75f82eaa502c ("ocfs2: fix NULL pointer dereference when dismount and
ocfs2rec simultaneously") may cause umount hang while shutting down
truncate log.

The situation is as followes:
ocfs2_dismout_volume
-> ocfs2_recovery_exit
-> free osb->recovery_map
-> ocfs2_truncate_shutdown
-> lock global bitmap inode
-> ocfs2_wait_for_recovery
-> check whether osb->recovery_map->rm_used is zero

Because osb->recovery_map is already freed, rm_used can be any other
values, so it may yield umount hang.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 69201bb1 04-Jun-2014 Fabian Frederick <fabf@skynet.be>

fs/ocfs2/super.c: use OCFS2_MAX_VOL_LABEL_LEN and strlcpy

Replace strncpy(size 63) by defined value.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a5c4e2a 04-Jun-2014 Fabian Frederick <fabf@skynet.be>

ocfs2: remove NULL assignments on static

Static values are automatically initialized to NULL.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 43b10a20 03-Apr-2014 jiangyiwen <jiangyiwen@huawei.com>

ocfs2: avoid system inode ref confusion by adding mutex lock

The following case may lead to the same system inode ref in confusion.

A thread B thread
ocfs2_get_system_file_inode
->get_local_system_inode
->_ocfs2_get_system_file_inode
because of *arr == NULL,
ocfs2_get_system_file_inode
->get_local_system_inode
->_ocfs2_get_system_file_inode
gets first ref thru
_ocfs2_get_system_file_inode,
gets second ref thru igrab and
set *arr = inode
at the moment, B thread also gets
two refs, so lead to one more
inode ref.

So add mutex lock to avoid multi thread set two inode ref once at the
same time.

Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8ed6b237 03-Apr-2014 Goldwyn Rodrigues <rgoldwyn@suse.de>

ocfs2: revert iput deferring code in ocfs2_drop_dentry_lock

The following patches are reverted in this patch because these patches
caused performance regression in the remote unlink() calls.

ea455f8ab683 - ocfs2: Push out dropping of dentry lock to ocfs2_wq
f7b1aa69be13 - ocfs2: Fix deadlock on umount
5fd131893793 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount

Previous patches in this series removed the possible deadlocks from
downconvert thread so the above patches shouldn't be needed anymore.

The regression is caused because these patches delay the iput() in case
of dentry unlocks. This also delays the unlocking of the open lockres.
The open lockresource is required to test if the inode can be wiped from
disk or not. When the deleting node does not get the open lock, it
marks it as orphan (even though it is not in use by another
node/process) and causes a journal checkpoint. This delays operations
following the inode eviction. This also moves the inode to the orphaned
inode which further causes more I/O and a lot of unneccessary orphans.

The following script can be used to generate the load causing issues:

declare -a create
declare -a remove
declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
unique="`mktemp -u XXXXX`"
script="/tmp/idontknow-${unique}.sh"
cat <<EOF > "${script}"
for n in {1..8}; do mkdir -p test/dir\${n}
eval touch test/dir\${n}/foo{1.."\$1"}
done
EOF
chmod 700 "${script}"

function fcreate ()
{
exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
}

function fremove ()
{
exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
}

function fcp ()
{
exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
}

echo -------------------------------------------------
echo "| # files | create #s | copy #s | remove #s |"
echo -------------------------------------------------
for ((x=0; x < ${#iterations[*]} ; x++)) do
create[$x]="`fcreate ${iterations[$x]}`"
copy[$x]="`fcp ${iterations[$x]}`"
remove[$x]="`fremove`"
printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
done
rm "${script}"
echo "------------------------"

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e3a767b6 03-Apr-2014 Jan Kara <jack@suse.cz>

ocfs2: implement delayed dropping of last dquot reference

We cannot drop last dquot reference from downconvert thread as that
creates the following deadlock:

NODE 1 NODE2
holds dentry lock for 'foo'
holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
dquot_initialize(bar)
ocfs2_dquot_acquire()
ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
...
downconvert thread (triggered from another
node or a different process from NODE2)
ocfs2_dentry_post_unlock()
...
iput(foo)
ocfs2_evict_inode(foo)
ocfs2_clear_inode(foo)
dquot_drop(inode)
...
ocfs2_dquot_release()
ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
- blocks
finds we need more space in
quota file
...
ocfs2_extend_no_holes()
ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
- deadlocks waiting for
downconvert thread

We solve the problem by postponing dropping of the last dquot reference to
a workqueue if it happens from the downconvert thread.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2931cdcb 03-Apr-2014 Darrick J. Wong <darrick.wong@oracle.com>

ocfs2: improve fsync efficiency and fix deadlock between aio_write and sync_file

Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
transaction to complete. This isn't terribly efficient, since sync_file
really only needs to wait for the last transaction involving that inode
to complete, and this doesn't require i_mutex.

Therefore, implement the necessary bits to track the newest tid
associated with an inode, and teach sync_file to wait for that instead
of waiting for everything in the journal to commit. Furthermore, only
issue the flush request to the drive if jbd2 hasn't already done so.

This also eliminates the deadlock between ocfs2_file_aio_write() and
ocfs2_sync_file(). aio_write takes i_mutex then calls
ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
However, if that dio completion involves calling fsync, then we can get
into trouble when some ocfs2_sync_file tries to take i_mutex.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a75fe48c 03-Apr-2014 joyce.xue <xuejiufei@huawei.com>

ocfs2: remove unused variable uuid_net_key in ocfs2_initialize_super

Variable uuid_net_key in ocfs2_initialize_super() is not used. Clean it
up.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c18ceab0 03-Apr-2014 Wengang Wang <wen.gang.wang@oracle.com>

ocfs2: change ip_unaligned_aio to of type mutex from atomit_t

There is a problem that waitqueue_active() may check stale data thus miss
a wakeup of threads waiting on ip_unaligned_aio.

The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to
be of type mutex thus the above prolem is avoid. Another benifit is that
mutex which works as FIFO is fairer than wake_up_all().

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 02b9984d 13-Mar-2014 Theodore Ts'o <tytso@mit.edu>

fs: push sync_filesystem() down to the file system's remount_fs()

Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs(). This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior. In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Anders Larsen <al@alarsen.net>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org


# 75f82eaa 21-Jan-2014 Yiwen Jiang <jiangyiwen@huawei.com>

ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously

2 nodes cluster, say Node A and Node B, mount the same ocfs2 volume, and
create a file 1.

Node A Node B
open 1, get open lock
rm 1, and then add 1 to orphan_dir
storage link down,
o2hb_write_timeout
->o2quo_disk_timeout
->emergency_restart
at the moment, Node B dismount and do
ocfs2rec simultaneously
1) ocfs2_dismount_volume
->ocfs2_recovery_exit
->wait_event(osb->recovery_event)
->flush_workqueue(ocfs2_wq)
2) ocfs2rec
->queue_work(&journal->j_recovery_work)
->ocfs2_recover_orphans
->ocfs2_commit_truncate
->queue_delayed_work(&osb->osb_truncate_log_wq)

In ocfs2_recovery_exit, it flushes workqueue and then releases system
inodes. When doing ocfs2rec, it will call ocfs2_flush_truncate_log
which will try to get sys_root_inode, and NULL pointer dereference
occurs.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce <xuejiufei@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c74a3bdd 21-Jan-2014 Goldwyn Rodrigues <rgoldwyn@suse.de>

ocfs2: add clustername to cluster connection

This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
handling up to the times with respect to DLM (>=4.0.1) and corosync
(2.3.x). AFAIK, cman also is being phased out for a unified corosync
cluster stack.

fs/dlm performs all the functions with respect to fencing and node
management and provides the API's to do so for ocfs2. For all future
references, DLM stands for fs/dlm code.

The advantages are:
+ No need to run an additional userspace daemon (ocfs2_controld)
+ No controld device handling and controld protocol
+ Shifting responsibilities of node management to DLM layer

For backward compatibility, we are keeping the controld handling code.
Once enough time has passed we can remove a significant portion of the
code. This was tested by using the kernel with changes on older
unmodified tools. The kernel used ocfs2_controld as expected, and
displayed the appropriate warning message.

This feature requires modification in the userspace ocfs2-tools. The
changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
nocontrold Currently, not many checks are present in the userspace code,
but that would change soon.

This patch (of 6):

Add clustername to cluster connection.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ff8fb335 21-Jan-2014 Goldwyn Rodrigues <rgoldwyn@suse.de>

ocfs2: remove versioning information

The versioning information is confusing for end-users. The numbers are
stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the
versioning system in the OCFS2 modules and let the kernel version be the
guide to debug issues.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7391a294 12-Nov-2013 Rui Xiang <rui.xiang@huawei.com>

ocfs2: return ENOMEM when sb_getblk() fails

The only reason for sb_getblk() failing is if it can't allocate the
buffer_head. So return ENOMEM instead when it fails.

[joseph.qi@huawei.com: ocfs2_symlink_get_block() and ocfs2_read_blocks_sync() and ocfs2_read_blocks() need the same change]
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 99d7a882 24-Sep-2013 Goldwyn Rodrigues <rgoldwyn@suse.de>

fs/ocfs2/super.c: use a bigger nodestr in ocfs2_dismount_volume

While printing 32-bit node numbers, an 8-byte string is not enough.
Increase the size of the string to 12 chars.

This got left out in commit 49fa8140e487 ("fs/ocfs2/super.c: Use bigger
nodestr to accomodate 32-bit node numbers").

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49fa8140e 28-Aug-2013 Goldwyn Rodrigues <rgoldwyn@suse.de>

fs/ocfs2/super.c: Use bigger nodestr to accomodate 32-bit node numbers

While using pacemaker/corosync, the node numbers are generated using IP
address as opposed to serial node number generation. This may not fit
in a 8-byte string. Use a bigger string to print the complete node
number.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8fa9d17f 03-Jul-2013 Goldwyn Rodrigues <rgoldwyn@gmail.com>

ocfs2: remove unecessary variable needs_checkpoint

Code cleanup: needs_checkpoint is assigned to but never used. Delete
the variable.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 91417705 07-Mar-2013 Eric W. Biederman <ebiederm@xmission.com>

fs: Limit sys_mount to only request filesystem modules (Part 2).

Add missing MODULE_ALIAS_FS("ocfs2") how did I miss that?
Remove unnecessary MODULE_ALIAS_FS("devpts") devpts can not be modular.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# d787ab09 21-Feb-2013 Tim Gardner <tim.gardner@canonical.com>

ocfs2: remove kfree() redundant null checks

smatch analysis indicates a number of redundant NULL checks before
calling kfree(), eg:

fs/ocfs2/alloc.c:6138 ocfs2_begin_truncate_log_recovery() info:
redundant null check on *tl_copy calling kfree()

fs/ocfs2/alloc.c:6755 ocfs2_zero_range_for_truncate() info:
redundant null check on pages calling kfree()

etc....

[akpm@linux-foundation.org: revert dubious change in ocfs2_begin_truncate_log_recovery()]
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8c0a8537 25-Sep-2012 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

fs: push rcu_barrier() from deactivate_locked_super() to filesystems

There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 342827d7 17-Mar-2012 Al Viro <viro@zeniv.linux.org.uk>

ocfs2: fix leaks on failure exits in module_init

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 48fde701 08-Jan-2012 Al Viro <viro@zeniv.linux.org.uk>

switch open-coded instances of d_make_root() to new helper

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# be0d93f0 12-Feb-2012 Al Viro <viro@zeniv.linux.org.uk>

... and the same failure exits cleanup for ocfs2

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 34c80b1d 08-Dec-2011 Al Viro <viro@zeniv.linux.org.uk>

vfs: switch ->show_options() to struct dentry *

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6b520e05 12-Dec-2011 Al Viro <viro@zeniv.linux.org.uk>

vfs: fix the stupidity with i_dentry in inode destructors

Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e41d33af 17-Oct-2011 Dan Magenheimer <dan.magenheimer@oracle.com>

ocfs2: send correct UUID to cleancache initialization

ocfs2: Fix cleancache initialization call to correctly pass uuid

As reported by Steven Whitehouse in https://lkml.org/lkml/2011/5/27/221
the ocfs2 volume UUID is incorrectly passed to cleancache.
As a result, shared-ephemeral tmem pools will not actually
be created; instead they will be private (unshared) which
misses out on a major benefit of tmem.

Reported-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>


# a11f7e63 22-Jun-2011 Mark Fasheh <mfasheh@suse.com>

ocfs2: serialize unaligned aio

Fix a corruption that can happen when we have (two or more) outstanding
aio's to an overlapping unaligned region. Ext4
(e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d) and xfs recently had to fix
similar issues.

In our case what happens is that we can have an outstanding aio on a region
and if a write comes in with some bytes overlapping the original aio we may
decide to read that region into a page before continuing (typically because
of buffered-io fallback). Since we have no ordering guarantees with the
aio, we can read stale or bad data into the page and then write it back out.

If the i/o is page and block aligned, then we avoid this issue as there
won't be any need to read data from disk.

I took the same approach as Eric in the ext4 patch and introduced some
serialization of unaligned async direct i/o. I don't expect this to have an
effect on the most common cases of AIO. Unaligned aio will be slower
though, but that's far more acceptable than data corruption.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>


# 619c200d 24-Jul-2011 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Clean up messages in the fs

Convert useful messages from ML_NOTICE to KERN_NOTICE to improve readability.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>


# 9e1f1de0 03-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

more conservative S_NOSEC handling

Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.

AFAICS, the only reasonably sane way to deal with that is
* new superblock flag; unless set, S_NOSEC is not going to be set.
* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.

It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 03efed8a 27-May-2011 Tiger Yang <tiger.yang@oracle.com>

ocfs2: Bugfix for hard readonly mount

ocfs2 cannot currently mount a device that is readonly at the media
("hard readonly"). Fix the broken places.
see detail: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1322

[ Description edited -- Joel ]

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Reviewed-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>


# 1cfd8bd0 26-May-2011 Dan Magenheimer <dan.magenheimer@oracle.com>

ocfs2: add cleancache support

This eighth patch of eight in this cleancache series "opts-in"
cleancache for ocfs2. Clustered filesystems must explicitly enable
cleancache by calling cleancache_init_shared_fs anytime an instance
of the filesystem is mounted. Ocfs2 is currently the only user of
the clustered filesystem interface but nevertheless, the cleancache
hooks in the VFS layer are sufficient for ocfs2 including the matching
cleancache_flush_fs hook which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v8: trivial merge conflict update]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Nitin Gupta <ngupta@vflare.org>


# e2b0c215 02-Mar-2011 Tiger Yang <tiger.yang@oracle.com>

ocfs2: clean up mount option about atime in ocfs2.txt

As ocfs2 supports relatime and strictatime, we need update the
relative document. Atime_quantum need work with strictatime, so only
show it in procfs when mount with strictatime.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# 32a42d39 23-Feb-2011 Tao Ma <boyu.mt@taobao.com>

ocfs2: Remove masklog ML_SUPER.

Remove mlog(0) from fs/ocfs2/super.c and the masklog SUPER.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>


# 80a9a84d 20-Feb-2011 Wengang Wang <wen.gang.wang@oracle.com>

ocfs2: Add ocfs2_trace.h.

About one year ago, Wengang Wang tried some first steps
to add tracepoints to ocfs2. Hiss original patch is here:
http://oss.oracle.com/pipermail/ocfs2-devel/2009-November/005512.html

But as Steven Rostedt indicated in his article
http://lwn.net/Articles/383362/, we'd better have our trace
files resides in fs/ocfs2, so I rewrited the patch using the
method Steven mentioned in that article.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>


# c1e8d35e 07-Mar-2011 Tao Ma <boyu.mt@taobao.com>

ocfs2: Remove EXIT from masklog.

mlog_exit is used to record the exit status of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.

This patch just try to remove it or change it. So:
1. if all the error paths already use mlog_errno, it is just removed.
Otherwise, it will be replaced by mlog_errno.
2. if it is used to print some return value, it is replaced with
mlog(0,...).
mlog_exit_ptr is changed to mlog(0.
All those mlog(0,...) will be replaced with trace events later.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>


# ef6b689b 20-Feb-2011 Tao Ma <boyu.mt@taobao.com>

ocfs2: Remove ENTRY from masklog.

ENTRY is used to record the entry of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.

So for mlog_entry_void, we just remove it.
for mlog_entry(...), we replace it with mlog(0,...), and they
will be replace by trace event later.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>


# 52c303c5 31-Jan-2011 Mark Fasheh <mfasheh@suse.com>

ocfs2: Check heartbeat mode for kernel stacks only

Commit 2c442719e90a44a6982c033d69df4aae4b167cfa added some checks for proper
heartbeat mode when the o2cb stack is running. Unfortunately, it didn't
take into account that a userpsace stack could be running. Fix this by only
doing the check if o2cb is in use. This patch allows userspace stacks to
mount the fs again.

Cc: stable@kernel.org
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>


# 316873c9 01-Feb-2011 Tejun Heo <tj@kernel.org>

ocfs2: use system_wq instead of ocfs2_quota_wq

ocfs2_quota_wq is not depended upon during memory reclaim and, with
cmwq, there's no reason to use a dedicated workqueue. Drop
ocfs2_quota_wq and use system_wq instead. dqi_sync_work is already
sync canceled on quota disable and no further synchronization is
necessary.

This change makes ocfs2_quota_setup/shutdown() noops. Both functions
removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>


# ba87167c 17-Dec-2010 Al Viro <viro@zeniv.linux.org.uk>

switch ocfs2, close races

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f00c9e44 15-Sep-2010 Jan Kara <jack@suse.cz>

quota: Fix deadlock during path resolution

As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
is prone to deadlocks. We hold s_umount semaphore for reading during the
path resolution and resolution itself may need to acquire the semaphore
for writing when e. g. autofs mountpoint is passed.

Solve the problem by performing the resolution before we get hold of the
superblock (and thus s_umount semaphore). The whole thing is complicated
by the fact that some filesystems (OCFS2) ignore the path argument. So to
distinguish between filesystem which want the path and which do not we
introduce new .quota_on_meta callback which does not get the path. OCFS2
then uses this callback instead of old .quota_on.

CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christoph Hellwig <hch@lst.de>
CC: Ted Ts'o <tytso@mit.edu>
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>


# fa0d7e3d 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: icache RCU free inodes

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 451a3c24 17-Nov-2010 Arnd Bergmann <arnd@arndb.de>

BKL: remove extraneous #include <smp_lock.h>

The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 152a0836 24-Jul-2010 Al Viro <viro@zeniv.linux.org.uk>

new helper: mount_bdev()

... and switch of the obvious get_sb_bdev() users to ->mount()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7bdb0d18 11-Oct-2010 Tristan Ye <tristan.ye@oracle.com>

ocfs2: Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes.

Currently, the default behavior of O_DIRECT writes was allowing
concurrent writing among nodes to the same file, with no cluster
coherency guaranteed (no EX lock held). This can leave stale data in
the cache for buffered reads on other nodes.

The new mount option introduce a chance to choose two different
behaviors for O_DIRECT writes:

* coherency=full, as the default value, will disallow
concurrent O_DIRECT writes by taking
EX locks.

* coherency=buffered, allow concurrent O_DIRECT writes
without EX lock among nodes, which
gains high performance at risk of
getting stale data on other nodes.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 75d9bbc7 10-Oct-2010 Goldwyn Rodrigues <rgoldwyn@gmail.com>

Initialize max_slots early

Functions such as ocfs2_recovery_init() make use of osb->max_slots.
Initialize osb->max_slots early so the functions may use the correct
value.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 2c442719 07-Oct-2010 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Add support for heartbeat=global mount option

Adds support for heartbeat=global mount option. It ensures that the heartbeat
mode passed matches the one enabled on disk.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>


# 98f486f2 09-Oct-2010 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Add an incompat feature flag OCFS2_FEATURE_INCOMPAT_CLUSTERINFO

OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
both userspace and o2cb cluster stacks. It also allows us to extend cluster
info to include stack flags.

This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
global heartbeat mode.

This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>


# 60056794 24-Feb-2010 Arnd Bergmann <arnd@arndb.de>

BKL: Remove BKL from OCFS2

The BKL in ocfs2/dlmfs is used in put_super, fill_super and remount_fs
that are all three protected by the superblocks s_umount rw_semaphore.

The use in ocfs2_control_open is evidently unrelated and the function
is protected by ocfs2_control_lock.

Therefore it is safe to remove the BKL entirely.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>


# db719222 15-Aug-2010 Jan Blunck <jblunck@infradead.org>

BKL: Explicitly add BKL around get_sb/fill_super

This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.

I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.

do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.

Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.

[arnd: do not add the BKL to those file systems that already
don't use it elsewhere]

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>


# b4d693fc 16-Aug-2010 Tao Ma <tao.ma@oracle.com>

ocfs2: Cache system inodes of other slots.

Durring orphan scan, if we are slot 0, and we are replaying
orphan_dir:0001, the general process is that for every file
in this dir:
1. we will iget orphan_dir:0001, since there is no inode for it.
we will have to create an inode and read it from the disk.
2. do the normal work, such as delete_inode and remove it from
the dir if it is allowed.
3. call iput orphan_dir:0001 when we are done. In this case,
since we have no dcache for this inode, i_count will
reach 0, and VFS will have to call clear_inode and in
ocfs2_clear_inode we will checkpoint the inode which will let
ocfs2_cmt and journald begin to work.
4. We loop back to 1 for the next file.

So you see, actually for every deleted file, we have to read the
orphan dir from the disk and checkpoint the journal. It is very
time consuming and cause a lot of journal checkpoint I/O.
A better solution is that we can have another reference for these
inodes in ocfs2_super. So if there is no other race among
nodes(which will let dlmglue to checkpoint the inode), for step 3,
clear_inode won't be called and for step 1, we may only need to
read the inode for the 1st time. This is a big win for us.

So this patch will try to cache system inodes of other slots so
that we will have one more reference for these inodes and avoid
the extra inode read and journal checkpoint.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 3bdb8efd 22-Jul-2010 Patrick J. LoPresti <lopresti@gmail.com>

OCFS2: Allow huge (> 16 TiB) volumes to mount

The OCFS2 developers have already done all of the hard work to allow
volumes larger than 16 TiB. But there is still a "sanity check" in
fs/ocfs2/super.c that prevents the mounting of such volumes, even when
the cluster size and journal options would allow it.

This patch replaces that sanity check with a more sophisticated one to
mount a huge volume provided that (a) it is addressable by the raw
word/address size of the system (borrowing a test from ext4); (b) the
volume is using JBD2; and (c) the JBD2_FEATURE_INCOMPAT_64BIT flag is
set on the journal.

I factored out the sanity check into its own function. I also moved it
from ocfs2_initialize_super() down to ocfs2_check_volume(); any earlier,
and the journal will not have been initialized yet.

This patch is one of a pair, and it depends on the other ("JBD2: Allow
feature checks before journal recovery").

I have tested this patch on small volumes, huge volumes, and huge
volumes without 64-bit block support in the journal. All of them appear
to work or to fail gracefully, as appropriate.

Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 066d92dc 08-Jun-2010 Al Viro <viro@zeniv.linux.org.uk>

convert ocfs2 to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 421f91d2 10-Jun-2010 Uwe Kleine-König <u.kleine-koenig@pengutronix.de>

fix typos concerning "initiali[zs]e"

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 287a8095 19-May-2010 Christoph Hellwig <hch@infradead.org>

quota: rename default quotactl methods to dquot_

Follow the dquot_* style used elsewhere in dquot.c.

[Jan Kara: Fixed up missing conversion of ext2]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 307ae18a 19-May-2010 Christoph Hellwig <hch@infradead.org>

quota: drop remount argument to ->quota_on and ->quota_off

Remount handling has fully moved into the filesystem, so all this is
superflous now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 0f0dd62f 19-May-2010 Christoph Hellwig <hch@infradead.org>

quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers

Instead of having wrappers in the VFS namespace export the dquot_suspend
and dquot_resume helpers directly. Also rename vfs_quota_disable to
dquot_disable while we're at it.

[Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# eea7feb0 13-May-2010 Jan Kara <jack@suse.cz>

ocfs2: Fix use after free on remount read-only

We also have to cancel quota syncing thread on remount read only because
at that moment quota is being turned off. Otherwise quota syncing thread
will try to access already freed quota structures.

Signed-off-by: Jan Kara <jack@suse.cz>


# c06bcbfa 13-May-2010 Jan Kara <jack@suse.cz>

ocfs2: Fix lock inversion in quotas during umount

We cannot cancel delayed work from ocfs2_local_free_info because that is called
with dqonoff_mutex held and the work it cancels requires dqonoff_mutex to
finish. Cancel the work before acquiring dqonoff_mutex.

Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>


# e4b963f1 02-Sep-2009 Joel Becker <joel.becker@oracle.com>

ocfs2: Wrap signal blocking in void functions.

ocfs2 sometimes needs to block signals around dlm operations, but it
currently does it with sigprocmask(). Even worse, it's checking the
error code of sigprocmask(). The in-kernel sigprocmask() can only error
if you get the SIG_* argument wrong. We don't.

Wrap the sigprocmask() calls with ocfs2_[un]block_signals(). These
functions are void, but they will BUG() if somehow sigprocmask() returns
an error.

Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 4b37fcb7 13-Apr-2010 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Make nointr a default mount option

OCFS2 has never really supported intr. This patch acknowledges this reality
and makes nointr the default mount option. In a later patch, we intend to
support intr.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 83f92318 05-Apr-2010 Mark Fasheh <mfasheh@suse.com>

ocfs2: Add dir_resv_level mount option

The default behavior for directory reservations stays the same, but we add a
mount option so people can tweak the size of directory reservations
according to their workloads.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 6b82021b 05-Apr-2010 Mark Fasheh <mfasheh@suse.com>

ocfs2: increase the default size of local alloc windows

I have observed that the current size of 8M gives us pretty poor
fragmentation on multi-threaded workloads which do lots of writes.

Generally, I can increase the size of local alloc windows and observe a
marked decrease in fragmentation, even up and beyond window sizes of 512
megabytes. This makes sense for a couple reasons - larger local alloc means
more room for reservation windows. On multi-node workloads the larger local
alloc helps as well because we don't have to do window slides as often.

Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
longer used and the comment above it was out of date.

To test fragmentation, I used a workload which launched 4 threads that did
4k writes into a series of about 140 alternating files.

With resv_level=2, and a 4k/4k file system I observed the following average
fragmentation for various localalloc= parameters:

localalloc= avg. fragmentation
8 48
32 16
64 10
120 7

On larger cluster sizes, the difference is more dramatic.

The new default size top out at 256M, which we'll only get for cluster
sizes of 32K and above.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 73c8a800 05-Apr-2010 Mark Fasheh <mfasheh@suse.com>

ocfs2: clean up localalloc mount option size parsing

This patch pulls the local alloc sizing code into localalloc.c and provides
a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
except that I correctly calculate the maximum local alloc size. The old code
in ocfs2_parse_options() calculated the max size as:

ocfs2_local_alloc_size(sb) * 8

which is correct, in bits. Unfortunately though the option passed in is in
megabytes. Ultimately, this bug made no real difference - the shrink code
would catch a too-large size and bring it down to something reasonable.
Still, it's less than efficient as-is.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 4fe370af 07-Dec-2009 Mark Fasheh <mfasheh@suse.com>

ocfs2: use allocation reservations during file write

Add a per-inode reservations structure and pass it through to the
reservations code.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# d02f00cc 07-Dec-2009 Mark Fasheh <mfasheh@suse.com>

ocfs2: allocation reservations

This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.

Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 8571882c 13-Apr-2010 Tao Ma <tao.ma@oracle.com>

ocfs2: ocfs2_group_bitmap_size has to handle old volume.

ocfs2_group_bitmap_size has to handle the case when the
volume don't have discontiguous block group support. So
pass the feature_incompat in and check it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>


# 4cbe4249 13-Apr-2010 Joel Becker <joel.becker@oracle.com>

ocfs2: Define data structures for discontiguous block groups.

Defines the OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG feature bit and modifies
struct ocfs2_group_desc for the feature.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>


# b89c5428 24-Jan-2010 Tiger Yang <tiger.yang@oracle.com>

ocfs2: add extent block stealing for ocfs2 v5

This patch add extent block (metadata) stealing mechanism for
extent allocation. This mechanism is same as the inode stealing.
if no room in slot specific extent_alloc, we will try to
allocate extent block from the next slot.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 2bd63216 25-Jan-2010 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2/trivial: Remove trailing whitespaces

Patch removes trailing whitespaces.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 837711f8 16-Jan-2009 Coly Li <coly.li@suse.de>

ocfs2: return f_fsid info in ocfs2_statfs()

Currently the f_fsid of struct kstatfs returned from ocfs2_statfs() is
undefined (vfs layer fills in 0 as default). Since in some conditions,
f_fsid value might be used in a (f_fsid, ino) pair to uniquely identify
a file, ocfs2 should return a unique defined f_fsid value from
ocfs2_statfs().

Because uuid_str is the same on big or litlle endian machine, it's
endian consistent to use osb->uuid_str to generate f_fsid value.

Signed-off-by: Coly Li <coly.li@suse.de>
Cc: Sunil Mushran <sunil.mushran@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 57b09bb5 15-Oct-2009 Jan Kara <jack@suse.cz>

ocfs2: Set MS_POSIXACL on remount

We have to set MS_POSIXACL on remount as well. Otherwise VFS
would not know we started supporting ACLs after remount and
thus ACLs would not work.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 5297aad8 15-Oct-2009 Jan Kara <jack@suse.cz>

ocfs2: Make acl use the default

Change acl mount options handling to match the one of XFS and BTRFS and
hopefully it is also easier to use now. When admin does not specify any
acl mount option, acls are enabled if and only if the filesystem has
xattr feature enabled. If admin specifies 'acl' mount option, we fail
the mount if the filesystem does not have xattr feature and thus acls
cannot be enabled.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# e6aabe0c 15-Oct-2009 Jan Kara <jack@suse.cz>

ocfs2: Always include ACL support

To become consistent with filesystems such as XFS or BTRFS, make posix
ACLs always available. This also reduces possibility of
misconfiguration on admin's side.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# fb5cbe9e 28-Oct-2009 Joel Becker <joel.becker@oracle.com>

ocfs2: Return -EINVAL when a device is not ocfs2.

In case of non-modular kernels the root filesystem is mounted by trying
several filesystems. If ocfs2 was tried before the actual filesystem
type, the mount would fail because ocfs2_sb_probe() returns -EAGAIN
instead of -EINVAL. ocfs2 will now return -EINVAL properly.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Reported-by: Laszlo Attila Toth <panther@balabit.hu>


# 828c0950 01-Oct-2009 Alexey Dobriyan <adobriyan@gmail.com>

const: constify remaining file_operations

[akpm@linux-foundation.org: fix KVM]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2bcd57ab 23-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

headers: utsname.h redux

* remove asm/atomic.h inclusion from linux/utsname.h --
not needed after kref conversion
* remove linux/utsname.h inclusion from files which do not need it

NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however
due to some personality stuff it _is_ needed -- cowardly leave ELF-related
headers and files alone.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a2f2ddbf 19-Aug-2009 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: __ocfs2_abort() should not enable panic for local mounts

In a clustered setup, we have to panic the box on journal abort. This is
because we don't have the facility to go hard readonly. With hard ro, another
node would detect node failure and initiate recovery.

Having said that, we shouldn't force panic if the volume is mounted locally.
This patch defers the handling to the mount option, errors.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 374a263e 23-Aug-2009 Tao Ma <tao.ma@oracle.com>

ocfs2: Add refcount tree lock mechanism.

Implement locking around struct ocfs2_refcount_tree. This protects
all read/write operations on refcount trees. ocfs2_refcount_tree
has its own lock and its own caching_info, protecting buffers among
multiple nodes.

User must call ocfs2_lock_refcount_tree before his operation on
the tree and unlock it after that.

ocfs2_refcount_trees are referenced by the block number of the
refcount tree root block, So we create an rb-tree on the ocfs2_super
to look them up.

Signed-off-by: Tao Ma <tao.ma@oracle.com>


# 0d54b217 21-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

const: make struct super_block::s_qcop const

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 292dd27e 12-Feb-2009 Joel Becker <joel.becker@oracle.com>

ocfs2: move ip_created_trans to struct ocfs2_caching_info

Similar ip_last_trans, ip_created_trans tracks the creation of a journal
managed inode. This specifically tracks what transaction created the
inode. This is so the code can know if the inode has ever been written
to disk.

This behavior is desirable for any journal managed object. We move it
to struct ocfs2_caching_info as ci_created_trans so that any object
using ocfs2_caching_info can rely on this behavior.

Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 66fb345d 12-Feb-2009 Joel Becker <joel.becker@oracle.com>

ocfs2: move ip_last_trans to struct ocfs2_caching_info

We have the read side of metadata caching isolated to struct
ocfs2_caching_info, now we need the write side. This means the journal
functions. The journal only does a couple of things with struct inode.

This change moves the ip_last_trans field onto struct
ocfs2_caching_info as ci_last_trans. This field tells the journal
whether a pending journal flush is required.

Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 8cb471e8 10-Feb-2009 Joel Becker <joel.becker@oracle.com>

ocfs2: Take the inode out of the metadata read/write paths.

We are really passing the inode into the ocfs2_read/write_blocks()
functions to get at the metadata cache. This commit passes the cache
directly into the metadata block functions, divorcing them from the
inode.

Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 6e5a3d75 10-Feb-2009 Joel Becker <joel.becker@oracle.com>

ocfs2: Change metadata caching locks to an operations structure.

We don't really want to cart around too many new fields on the
ocfs2_caching_info structure. So let's wrap all our access of the
parent object in a set of operations. One pointer on caching_info, and
more flexibility to boot.

Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 47460d65 10-Feb-2009 Joel Becker <joel.becker@oracle.com>

ocfs2: Make the ocfs2_caching_info structure self-contained.

We want to use the ocfs2_caching_info structure in places that are not
inodes. To do that, it can no longer rely on referencing the inode
directly.

This patch moves the flags to ocfs2_caching_info->ci_flags, stores
pointers to the parent's locks on the ocfs2_caching_info, and renames
the constants and flags to reflect its independant state.

Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 5fd13189 30-Jul-2009 Jan Kara <jack@suse.cz>

ocfs2: Don't oops in ocfs2_kill_sb on a failed mount

If we fail to mount the filesystem, we have to be careful not to dereference
uninitialized structures in ocfs2_kill_sb.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 1c1d9793 22-Jul-2009 Jan Kara <jack@suse.cz>

ocfs2: Fix initialization of blockcheck stats

We just set blockcheck stats to zeros but we should also
properly initialize the spinlock there.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# f7b1aa69 19-Jul-2009 Jan Kara <jack@suse.cz>

ocfs2: Fix deadlock on umount

In commit ea455f8ab68338ba69f5d3362b342c115bea8e13, we moved the dentry lock
put process into ocfs2_wq. This causes problems during umount because ocfs2_wq
can drop references to inodes while they are being invalidated by
invalidate_inodes() causing all sorts of nasty things (invalidate_inodes()
ending in an infinite loop, "Busy inodes after umount" messages etc.).

We fix the problem by stopping ocfs2_wq from doing any further releasing of
inode references on the superblock being unmounted, wait until it finishes
the current round of releasing and finally cleaning up all the references in
dentry_lock_list from ocfs2_put_super().

The issue was tracked down by Tao Ma <tao.ma@oracle.com>.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 8b712cd5 07-Jul-2009 Jeff Mahoney <jeffm@suse.com>

ocfs2: Fixup orphan scan cleanup after failed mount

If the mount fails for any reason, ocfs2_dismount_volume calls
ocfs2_orphan_scan_stop. It requires that ocfs2_orphan_scan_init
be called to setup the mutex and work queues, but that doesn't
happen if the mount has failed and we oops accessing an uninitialized
work queue.

This patch splits the init and startup of the orphan scan, eliminating
the oops.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# df152c24 22-Jun-2009 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Disable orphan scanning for local and hard-ro mounts

Local and Hard-RO mounts do not need orphan scanning.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 692684e1 19-Jun-2009 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Stop orphan scan as early as possible during umount

Currently if the orphan scan fires a tick before the user issues the umount,
the umount will wait for the queued orphan scan tasks to complete.

This patch makes the umount stop the orphan scan as early as possible so as
to reduce the probability of the queued tasks slowing down the umount.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# c3d38840 19-Jun-2009 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Fix ocfs2_osb_dump()

Skip printing information that is not valid for local mounts.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 90c699a9 19-Jun-2009 Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>

block: rename CONFIG_LBD to CONFIG_LBDAF

Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".

Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 337eb00a 12-May-2009 Alessio Igor Bogani <abogani@texware.it>

Push BKL down into ->remount_fs()

[xfs, btrfs, capifs, shmem don't need BKL, exempt]

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6cfd0148 05-May-2009 Christoph Hellwig <hch@lst.de>

push BKL down into ->put_super

Move BKL into ->put_super from the only caller. A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.

[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 94cb993f 27-Apr-2009 Christoph Hellwig <hch@infradead.org>

ocfs2: remove ->write_super and stop maintaining ->s_dirt

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 06c59bb8 18-May-2009 Tao Ma <tao.ma@oracle.com>

ocfs2: Remove redundant gotos in ocfs2_mount_volume()

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 73be192b 06-Jan-2009 Joel Becker <joel.becker@oracle.com>

ocfs2: Add statistics for the checksum and ecc operations.

It would be nice to know how often we get checksum failures. Even
better, how many of them we can fix with the single bit ecc. So, we add
a statistics structure. The structure can be installed into debugfs
wherever the user wants.

For ocfs2, we'll put it in the superblock-specific debugfs directory and
pass it down from our higher-level functions. The stats are only
registered with debugfs when the filesystem supports metadata ecc.

Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 15633a22 03-Jun-2009 Srinivas Eeda <srinivas.eeda@oracle.com>

ocfs2 patch to track delayed orphan scan timer statistics

Patch to track delayed orphan scan timer statistics.

Modifies ocfs2_osb_dump to print the following:
Orphan Scan=> Local: 10 Global: 21 Last Scan: 67 seconds ago

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 83273932 03-Jun-2009 Srinivas Eeda <srinivas.eeda@oracle.com>

ocfs2: timer to queue scan of all orphan slots

When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
before moving the dentry to the orphan directory. Other nodes that have
this dentry in cache have a PR on the same dentry lock. When the EX is
requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
during downconvert. The inode is finally deleted when the last node to iput
the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.

A problem arises if a node is forced to free dentry locks because of memory
pressure. If this happens, the node will no longer get downconvert
notifications for the dentries that have been unlinked on another node.
If it also happens that node is actively using the corresponding inode and
happens to be the one performing the last iput on that inode, it will fail
to delete the inode as it will not have the MAYBE_ORPHANED flag set.

This patch fixes this shortcoming by introducing a periodic scan of the
orphan directories to delete such inodes. Care has been taken to distribute
the workload across the cluster so that no one node has to perform the task
all the time.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>


# e1defc4f 22-May-2009 Martin K. Petersen <martin.petersen@oracle.com>

block: Do away with the notion of hardsect_size

Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 9140db04 06-Mar-2009 Srinivas Eeda <srinivas.eeda@oracle.com>

ocfs2: recover orphans in offline slots during recovery and mount

During recovery, a node recovers orphans in it's slot and the dead node(s). But
if the dead nodes were holding orphans in offline slots, they will be left
unrecovered.

If the dead node is the last one to die and is holding orphans in other slots
and is the first one to mount, then it only recovers it's own slot, which
leaves orphans in offline slots.

This patch queues complete_recovery to clean orphans for all offline slots
during mount and node recovery.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 9b7895ef 12-Nov-2008 Mark Fasheh <mfasheh@suse.com>

ocfs2: Add a name indexed b-tree to directory inodes

This patch makes use of Ocfs2's flexible btree code to add an additional
tree to directory inodes. The new tree stores an array of small,
fixed-length records in each leaf block. Each record stores a hash value,
and pointer to a block in the traditional (unindexed) directory tree where a
dirent with the given name hash resides. Lookup exclusively uses this tree
to find dirents, thus providing us with constant time name lookups.

Some of the hashing code was copied from ext3. Unfortunately, it has lots of
unfixed checkpatch errors. I left that as-is so that tracking changes would
be easier.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>


# 50397507 17-Dec-2008 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Expose the file system state via debugfs

This patch creates a per mount debugfs file, fs_state, which exposes
information like, cluster stack in use, states of the downconvert, recovery
and commit threads, number of journal txns, some allocation stats, list of
all slots, etc.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 28d57d43 12-Feb-2009 wengang wang <wen.gang.wang@oracle.com>

ocfs2: add IO error check in ocfs2_get_sector()

Check for IO error in ocfs2_get_sector().

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# c8b9cf9a 24-Feb-2009 Tao Ma <tao.ma@oracle.com>

ocfs2: lock the metaecc process for xattr bucket

For other metadata in ocfs2, metaecc is checked in ocfs2_read_blocks
with io_mutex held. While for xattr bucket, it is calculated by
the whole buckets. So we have to add a spin_lock to prevent multiple
processes calculating metaecc.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# ea455f8a 12-Jan-2009 Jan Kara <jack@suse.cz>

ocfs2: Push out dropping of dentry lock to ocfs2_wq

Dropping of last reference to dentry lock is a complicated operation involving
dropping of reference to inode. This can get complicated and quota code in
particular needs to obtain some quota locks which leads to potential deadlock.
Thus we defer dropping of inode reference to ocfs2_wq.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# d030cc97 11-Dec-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Validate superblock with checksum and ecc.

The superblock is read via a raw call. Validate it after we find it
from its signature.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 19ece546 21-Aug-2008 Jan Kara <jack@suse.cz>

ocfs2: Enable quota accounting on mount, disable on umount

Enable quota usage tracking on mount and disable it on umount. Also
add support for quota on and quota off quotactls and usrquota and
grpquota mount options. Add quota features among supported ones.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 171bf93c 20-Oct-2008 Mark Fasheh <mfasheh@suse.com>

ocfs2: Periodic quota syncing

This patch creates a work queue for periodic syncing of locally cached quota
information to the global quota files. We constantly queue a delayed work
item, to get the periodic behavior.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jan Kara <jack@suse.cz>


# 9e33d69f 25-Aug-2008 Jan Kara <jack@suse.cz>

ocfs2: Implementation of local and global quota file handling

For each quota type each node has local quota file. In this file it stores
changes users have made to disk usage via this node. Once in a while this
information is synced to global file (and thus with other nodes) so that
limits enforcement at least aproximately works.

Global quota files contain all the information about usage and limits. It's
mostly handled by the generic VFS code (which implements a trie of structures
inside a quota file). We only have to provide functions to convert structures
from on-disk format to in-memory one. We also have to provide wrappers for
various quota functions starting transactions and acquiring necessary cluster
locks before the actual IO is really started.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 1a224ad1 20-Aug-2008 Jan Kara <jack@suse.cz>

ocfs2: Assign feature bits and system inodes to quota feature and quota files

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# a68979b8 13-Nov-2008 Tiger Yang <tiger.yang@oracle.com>

ocfs2: add mount option and Kconfig option for acl

This patch adds the Kconfig option "CONFIG_OCFS2_FS_POSIX_ACL"
and mount options "acl" to enable acls in Ocfs2.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# a81cb88b 07-Oct-2008 Mark Fasheh <mfasheh@suse.com>

ocfs2: Don't check for NULL before brelse()

This is pointless as brelse() already does the check.

Signed-off-by: Mark Fasheh


# b0f73cfc 05-Sep-2008 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Add xattr mount option in ocfs2_show_options()

Patch adds check for [no]user_xattr in ocfs2_show_options() that completes
the list of all mount options.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 2b4e30fb 03-Sep-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Switch over to JBD2.

ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
limiting our maximum filesystem size.

It's a pretty trivial change. Most functions are just renamed. The
only functional change is moving to Jan's inode-based ordered data mode.
It's better, too.

Because JBD2 reads and writes JBD journals, this is compatible with any
existing filesystem. It can even interact with JBD-based ocfs2 as long
as the journal is formated for JBD.

We provide a compatibility option so that paranoid people can still use
JBD for the time being. This will go away shortly.

[ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
ocfs2_truncate_for_delete(). --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 12462f1d 03-Sep-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Add the 'inode64' mount option.

Now that ocfs2 limits inode numbers to 32bits, add a mount option to
disable the limit. This parallels XFS. 64bit systems can handle the
larger inode numbers.

[ Added description of inode64 mount option in ocfs2.txt. --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 8154da3d 18-Aug-2008 Tiger Yang <tiger.yang@oracle.com>

ocfs2: Add incompatible flag for extended attribute

This patch adds the s_incompat flag for extended attribute support. This
helps us ensure that older versions of Ocfs2 or ocfs2-tools will not be able
to mount a volume with xattr support.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# cf1d6c76 18-Aug-2008 Tiger Yang <tiger.yang@oracle.com>

ocfs2: Add extended attribute support

This patch implements storing extended attributes both in inode or a single
external block. We only store EA's in-inode when blocksize > 512 or that
inode block has free space for it. When an EA's value is larger than 80
bytes, we will store the value via b-tree outside inode or block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# fdd77704 18-Aug-2008 Tiger Yang <tiger.yang@oracle.com>

ocfs2: reserve inline space for extended attribute

Add the structures and helper functions we want for handling inline extended
attributes. We also update the inline-data handlers so that they properly
function in the event that we have both inline data and inline attributes
sharing an inode block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 9c7af40b 28-Jul-2008 Mark Fasheh <mfasheh@suse.com>

ocfs2: throttle back local alloc when low on disk space

Ocfs2's local allocator disables itself for the duration of a mount point
when it has trouble allocating a large enough area from the primary bitmap.
That can cause performance problems, especially for disks which were only
temporarily full or fragmented. This patch allows for the allocator to
shrink it's window first, before being disabled. Later, it can also be
re-enabled so that any performance drop is minimized.

To do this, we allow the value of osb->local_alloc_bits to be shrunk when
needed. The default value is recorded in a mostly read-only variable so that
we can re-initialize when required.

Locking had to be updated so that we could protect changes to
local_alloc_bits. Mostly this involves protecting various local alloc values
with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
is used when the local allocator is has shrunk, but is not disabled. If the
available space dips below 1 megabyte, the local alloc file is disabled. In
either case, local alloc is re-enabled 30 seconds after the event, or when
an appropriate amount of bits is seen in the primary bitmap.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# ebcee4b5 28-Jul-2008 Mark Fasheh <mfasheh@suse.com>

ocfs2: Track local alloc bits internally

Do this instead of tracking absolute local alloc size. This avoids
needless re-calculatiion of bits from bytes in localalloc.c. Additionally,
the value is now in a more natural unit for internal file system bitmap
work.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# a447c093 13-Oct-2008 Steven Whitehouse <swhiteho@redhat.com>

vfs: Use const for kernel parser table

This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.

This was posted for review some time ago and I believe its been in -mm
since then.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 539d8264 14-Jul-2008 Sunil Mushran <sunil.mushran@oracle.com>

[PATCH 2/2] ocfs2: Fix race between mount and recovery

As the fs recovery is asynchronous, there is a small chance that another
node can mount (and thus recover) the slot before the recovery thread
gets to it.

If this happens, the recovery thread will block indefinitely on the
journal/slot lock as that lock will be held for the duration of the mount
(by design) by the node assigned to that slot.

The solution implemented is to keep track of the journal replays using
a recovery generation in the journal inode, which will be incremented by the
thread replaying that journal. The recovery thread, before attempting the
blocking lock on the journal/slot lock, will compare the generation on disk
with what it has cached and skip recovery if it does not match.

This bug appears to have been inadvertently introduced during the mount/umount
vote removal by mainline commit 34d024f84345807bf44163fac84e921513dde323. In the
mount voting scheme, the messaging would indirectly indicate that the slot
was being recovered.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 51cc5068 25-Jul-2008 Alexey Dobriyan <adobriyan@gmail.com>

SL*B: drop kmem cache argument from constructor

Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.

Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c

This is flag day, yes.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 01af4820 10-Jun-2008 Wengang Wang <wen.gang.wang@oracle.com>

ocfs2: Handle error during journal load

This patch ensures the mount fails if the fs is unable to load the journal.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 4d0ddb2c 05-Mar-2008 Tao Ma <tao.ma@oracle.com>

ocfs2: Add inode stealing for ocfs2_reserve_new_inode

Inode allocation is modified to look in other nodes allocators during
extreme out of space situations. We retry our own slot when space is freed
back to the global bitmap, or whenever we've allocated more than 1024 inodes
from another slot.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# b61817e1 01-Feb-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Add the USERSPACE_STACK incompat bit.

The filesystem gains the USERSPACE_STACK incomat bit and the
s_cluster_info field on the superblock. When a userspace stack is in
use, the name of the stack is stored on-disk for mount-time
verification.

The "cluster_stack" option is added to mount(2) processing. The mount
process needs to pass the matching stack name. If the passed name and
the on-disk name do not match, the mount is failed.

When using the classic o2cb stack, the incompat bit is *not* set and no
mount option is used other than the usual heartbeat=local. Thus, the
filesystem is compatible with older tools.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 286eaa95 01-Feb-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Break out stackglue into modules.

We define the ocfs2_stack_plugin structure to represent a stack driver.
The o2cb stack code is split into stack_o2cb.c. This becomes the
ocfs2_stack_o2cb.ko module.

The stackglue generic functions are similarly split into the
ocfs2_stackglue.ko module. This module now provides an interface to
register drivers. The ocfs2_stack_o2cb driver registers itself. As
part of this interface, ocfs2_stackglue can load drivers on demand.
This is accomplished in ocfs2_cluster_connect().

ocfs2_cluster_disconnect() is now notified when a _hangup() is pending.
If a hangup is pending, it will not release the driver module and will
let _hangup() do that.

Signed-off-by: Joel Becker <joel.becker@oracle.com>


# 63e0c48a 30-Jan-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Clean up stackglue initialization

The stack glue initialization function needs a better name so that it can be
used cleanly when stackglue becomes a module.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 0abd6d18 29-Jan-2008 Mark Fasheh <mfasheh@suse.com>

ocfs2: Fill node number during cluster stack init

It doesn't make sense to query for a node number before connecting to the
cluster stack. This should be safe to do because node_num is only just
printed,
and we're actually only moving the setting of node num a small amount
further in the mount process.

[ Disconnect when node query fails -- Joel ]

Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 6953b4c0 29-Jan-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Move o2hb functionality into the stack glue.

The last bit of classic stack used directly in ocfs2 code is o2hb.
Specifically, the check for heartbeat during mount and the call to
ocfs2_hb_ctl during unmount.

We create an extra API, ocfs2_cluster_hangup(), to encapsulate the call
to ocfs2_hb_ctl. Other stacks will just leave hangup() empty.

The check for heartbeat is moved into ocfs2_cluster_connect(). It will
be matched by a similar check for other stacks.

With this change, only stackglue.c includes cluster/ headers.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 19fdb624 30-Jan-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Abstract out node number queries.

ocfs2 asks the cluster stack for the local node's node number for two
reasons; to fill the slot map and to print it. While the slot map isn't
necessary for userspace cluster stacks, the printing is very nice for
debugging. Thus we add ocfs2_cluster_this_node() as a generic API to get
this value. It is anticipated that the slot map will not be used under a
userspace cluster stack, so validity checks of the node num only need to
exist in the slot map code. Otherwise, it just gets used and printed as an
opaque value.

[ Fixed up some "int" versus "unsigned int" issues and made osb->node_num
truly opaque. --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 4670c46d 01-Feb-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Introduce the new ocfs2_cluster_connect/disconnect() API.

This step introduces a cluster stack agnostic API for initializing and
exiting. fs/ocfs2/dlmglue.c no longer uses o2cb/o2dlm knowledge to
connect to the stack. It is all handled in stackglue.c.

heartbeat.c no longer needs to know how it gets called.
ocfs2_do_node_down() is now a clean recovery trigger.

The big gotcha is the ordering of initializations and de-initializations done
underneath ocfs2_cluster_connect(). ocfs2_dlm_init() used to do all
o2dlm initialization in one block. Thus, the o2dlm functionality of
ocfs2_cluster_connect() is very straightforward. ocfs2_dlm_shutdown(),
however, did a few things between de-registration of the eviction
callback and actually shutting down the domain. Now de-registration and
shutdown of the domain are wrapped within the single
ocfs2_cluster_disconnect() call. I've checked the code paths to make
sure we can safely tear down things in ocfs2_dlm_shutdown() before
calling ocfs2_cluster_disconnect(). The filesystem has already set
itself to ignore the callback.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 24ef1815 29-Jan-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Separate out dlm lock functions.

This is the first in a series of patches to isolate ocfs2 from the
underlying cluster stack. Here we wrap the dlm locking functions with
ocfs2-specific calls. Because ocfs2 always uses the same dlm lock status
callbacks, we can eliminate the callbacks from the filesystem visible
functions.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 553abd04 01-Feb-2008 Joel Becker <joel.becker@oracle.com>

ocfs2: Change the recovery map to an array of node numbers.

The old recovery map was a bitmap of node numbers. This was sufficient
for the maximum node number of 254. Going forward, we want node numbers
to be UINT32. Thus, we need a new recovery map.

Note that we can't keep track of slots here. We must write down the
node number to recovery *before* we get the locks needed to convert a
node number into a slot number.

The recovery map is now an array of unsigned ints, max_slots in size.
It moves to journal.c with the rest of recovery.

Because it needs to be initialized, we move all of recovery initialization
into a new function, ocfs2_recovery_init(). This actually cleans up
ocfs2_initialize_super() a little as well. Following on, recovery cleaup
becomes part of ocfs2_recovery_exit().

A number of node map functions are rendered obsolete and are removed.

Finally, waiting on recovery is wrapped in a function rather than naked
checks on the recovery_event. This is a cleanup from Mark.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# 8e8a4603 01-Feb-2008 Mark Fasheh <mfasheh@suse.com>

ocfs2: Move slot map access into slot_map.c

journal.c and dlmglue.c would refresh the slot map by hand. Instead, have
the update and clear functions do the work inside slot_map.c. The eventual
result is to make ocfs2_slot_info defined privately in slot_map.c

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>


# d24fbcda 25-Jan-2008 Joel Becker <Joel.Becker@oracle.com>

ocfs2: Negotiate locking protocol versions.

Currently, when ocfs2 nodes connect via TCP, they advertise their
compatibility level. If the versions do not match, two nodes cannot speak
to each other and they disconnect. As a result, this provides no forward or
backwards compatibility.

This patch implements a simple protocol negotiation at the dlm level by
introducing a major/minor version number scheme for entities that
communicate. Specifically, o2dlm has a major/minor version for interaction
with o2dlm on other nodes, and ocfs2 itself has a major/minor version for
interacting with the filesystem on other nodes.

This will allow rolling upgrades of ocfs2 clusters when changes to the
locking or network protocols can be done in a backwards compatible manner.
In those cases, only the minor number is changed and the negotatied protocol
minor is returned from dlm join. In the far less likely event that a
required protocol change makes backwards compatibility impossible, we simply
bump the major number.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 5fa0613e 10-Jan-2008 Jan Kara <jack@suse.cz>

ocfs2: Silence false lockdep warnings

Create separate lockdep lock classes for system file's i_mutexes. They are
used to guard allocations and similar things and thus rank differently
than i_mutex of a regular file or directory.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 53fc622b 20-Dec-2007 Mark Fasheh <mark.fasheh@oracle.com>

[PATCH 2/2] ocfs2: cluster aware flock()

Hook up ocfs2_flock(), using the new flock lock type in dlmglue.c. A new
mount option, "localflocks" is added so that users can revert to old
functionality as need be.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 2fbe8d1e 20-Dec-2007 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Local alloc window size changeable via mount option

Local alloc is a performance optimization in ocfs2 in which a node
takes a window of bits from the global bitmap and then uses that for
all small local allocations. This window size is fixed to 8MB currently.
This patch allows users to specify the window size in MB including
disabling it by passing in 0. If the number specified is too large,
the fs will use the default value of 8MB.

mount -o localalloc=X /dev/sdX /mntpoint

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# d147b3d6 07-Nov-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Support commit= mount option

Mostly taken from ext3. This allows the user to set the jbd commit interval,
in seconds. The default of 5 seconds stays the same, but now users can
easily increase the commit interval. Typically, this would be increased in
order to benefit performance at the expense of data-safety.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# e9d578a8 18-Dec-2007 Tao Ma <tao.ma@oracle.com>

ocfs2: Initalize bitmap_cpg of ocfs2_super to be the maximum.

This value is initialized from global_bitmap->id2.i_chain.cl_cpg. If there
is only 1 group, it will be equal to the total clusters in the volume. So
as for online resize, it should change for all the nodes in the cluster.
It isn't easy and there is no corresponding lock for it.

bitmap_cpg is only used in 2 areas:
1. Check whether the suballoc is too large for us to allocate from the global
bitmap, so it is little used. And now the suballoc size is 2048, it rarely
meet this situation and the check is almost useless.
2. Calculate which group a cluster belongs to. We use it during truncate to
figure out which cluster group an extent belongs too. But we should be OK
if we increase it though as the cluster group calculated shouldn't change
and we only ever have a small bitmap_cpg on file systems with a single
cluster group.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# e63aecb6 18-Oct-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Rename ocfs2_meta_[un]lock

Call this the "inode_lock" now, since it covers both data and meta data.
This patch makes no functional changes.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# c934a92d 18-Oct-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Remove data locks

The meta lock now covers both meta data and data, so this just removes the
now-redundant data lock.

Combining locks saves us a round of lock mastery per inode and one less lock
to ping between nodes during read/write.

We don't lose much - since meta locks were always held before a data lock
(and at the same level) ordered writeout mode (the default) ensured that
flushing for the meta data lock also pushed out data anyways.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 34d024f8 24-Sep-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Remove mount/unmount votes

The node maps that are set/unset by these votes are no longer relevant, thus
we can remove the mount and umount votes. Since those are the last two
remaining votes, we can also remove the entire vote infrastructure.

The vote thread has been renamed to the downconvert thread, and the small
amount of functionality related to managing it has been moved into
fs/ocfs2/dlmglue.c. All references to votes have been removed or updated.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 6f7b056e 24-Sep-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Remove fs dependency on ocfs2_heartbeat module

Now that the dlm exposes domain information to us, we don't need generic
node up / node down callbacks. And since the DLM is only telling us when a
node goes down unexpectedly, we no longer need to optimize away node down
callbacks via the umount map.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# e001e796 07-Nov-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Reset journal parameters after s_mount_opt update

Right now we're just setting them from the existing parameters, not the
new ones that a remount specified.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 4ba9b9d0 17-Oct-2007 Christoph Lameter <clameter@sgi.com>

Slab API: remove useless ctor parameter and reorder parameters

Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.

Convert

ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bddb8eb3 26-Sep-2007 Denis Cheng <crquan@gmail.com>

[PATCH] fs/ocfs2/: removed unneeded initial value and function's return value

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# d550071c 06-Sep-2007 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Implement show_options()

Implement sops->show_options() so as to allow /proc/mounts to show the mount
options.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 19b613d4 04-Oct-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Clear slot map when umounting a local volume

This is technically harmless (recovery will clean it out later), but leaves
a bogus entry in the slot_map which really shouldn't be there.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# c0123ade 07-Sep-2007 Tiger Yang <tiger.yang@oracle.com>

[PATCH] ocfs2: fix mount option parsing

For some mount option types, ocfs2_parse_options() will try to access
sb->s_fs_info to get at the ocfs2 private superblock. Unfortunately, that
hasn't been allocated yet and will cause a kernel crash.

Fix this by storing options in a struct which can then get pushed into the
ocfs2_super once it's been allocated later. If we need more options which
store to the ocfs2_super in the future, we can just fields to this struct.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# e0dceaf0 09-Aug-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: set non-default s_time_gran during mount

We need to manually set this to '1' during mount, otherwise inode_setattr()
will chop off the nanosecond portion of our timestamps.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 5a254031 20-Jul-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Fix max offset calculations

ocfs2_max_file_offset() was over-estimating the largest file size for
several cases. This wasn't really a problem before, but now that we support
sparse files, it needs to be more accurate.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# a00cce35 20-Jul-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: use s_maxbytes directly in ocfs2_change_file_space()

There's no need to recalculate things via ocfs2_max_file_offset() as we've
already done that to fill s_maxbytes, so use that instead. We can also
un-export ocfs2_max_file_offset() then.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 20c2df83 19-Jul-2007 Paul Mundt <lethal@linux-sh.org>

mm: Remove slab destructors from kmem_cache_create().

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>


# b2580103 09-Mar-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Support xfs style space reservation ioctls

We re-use the RESVSP/UNRESVSP ioctls from xfs which allow the user to
allocate and deallocate regions to a file without zeroing data or changing
i_size.

Though renamed, the structure passed in from user is identical to struct
xfs_flock64. The three fields that are actually used right now are l_whence,
l_start and l_len.

This should get ocfs2 immediate compatibility with userspace software using
the pre-existing xfs ioctls.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# baf4661a 18-Jun-2007 Sunil Mushran <Sunil.Mushran@oracle.com>

ocfs2: Add "preferred slot" mount option

ocfs2 will attempt to assign the node the slot# provided in the mount
option. Failure to assign the preferred slot is not an error. This small
feature can be useful for automated testing.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# a35afb83 16-May-2007 Christoph Lameter <clameter@sgi.com>

Remove SLAB_CTOR_CONSTRUCTOR

SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 50953fe9 06-May-2007 Christoph Lameter <clameter@sgi.com>

slab allocators: Remove SLAB_DEBUG_INITIAL flag

I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again? The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free. That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on. If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code. But there is no such code
in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e. add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches. Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support. Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1ca1a111 27-Apr-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: fix sparse warnings in fs/ocfs2

None of these are actually harmful, but the noise makes looking for real
problems difficult.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 83418978 23-Apr-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Cache extent records

The extent map code was ripped out earlier because of an inability to deal
with holes. This patch adds back a simpler caching scheme requiring far less
code.

Our old extent map caching was designed back when meta data block caching in
Ocfs2 didn't work very well, resulting in many disk reads. These days our
metadata caching is much better, resulting in no un-necessary disk reads. As
a result, extent caching doesn't have to be as fancy, nor does it have to
cache as many extents. Keeping the last 3 extents seen should be sufficient
to give us a small performance boost on some streaming workloads.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 363041a5 17-Jan-2007 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: temporarily remove extent map caching

The code in extent_map.c is not prepared to deal with a subtree being
rotated between lookups. This can happen when filling holes in sparse files.
Instead of a lengthy patch to update the code (which would likely lose the
benefit of caching subtree roots), we remove most of the algorithms and
implement a simple path based lookup. A less ambitious extent caching scheme
will be added in a later patch.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 50008630 20-Mar-2007 Tiger Yang <tiger.yang@oracle.com>

ocfs2: Remove delete inode vote

Ocfs2 currently does cluster-wide node messaging to check the open state of
an inode during delete. This patch removes that mechanism in favor of an
inode cluster lock which is taken at shared read when an inode is first read
and dropped in clear_inode(). This allows a deleting node to test the
liveness of an inode by attempting to take an exclusive lock.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# ee9b6d61 12-Feb-2007 Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>

[PATCH] Mark struct super_operations const

This patch is inspired by Arjan's "Patch series to mark struct
file_operations and struct inode_operations const".

Compile tested with gcc & sparse.

Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cd861280 13-Dec-2006 Robert P. J. Day <rpjday@mindspring.com>

[PATCH] Fix numerous kcalloc() calls, convert to kzalloc()

All kcalloc() calls of the form "kcalloc(1,...)" are converted to the
equivalent kzalloc() calls, and a few kcalloc() calls with the incorrect
ordering of the first two arguments are fixed.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c271c5c2 05-Dec-2006 Sunil Mushran <Sunil.Mushran@oracle.com>

ocfs2: local mounts

This allows users to format an ocfs2 file system with a special flag,
OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT. When the file system sees this flag, it
will not use any cluster services, nor will it require a cluster
configuration, thus acting like a 'local' file system.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 4a6e617a 06-Dec-2006 Alexey Dobriyan <adobriyan@gmail.com>

[PATCH] fs/*: trivial vsnprintf() conversion

It would very lame to get buffer overflow via one of the following.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e18b890b 06-Dec-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] slab: remove kmem_cache_t

Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#

set -e

for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done

The script was run like this

sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e6b4f8da 06-Dec-2006 Christoph Lameter <clameter@sgi.com>

[PATCH] slab: remove SLAB_NOFS

SLAB_NOFS is an alias of GFP_NOFS.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 7f1a37e3 15-Nov-2006 Tiger Yang <tiger.yang@oracle.com>

ocfs2: core atime update functions

This patch adds the core routines for updating atime in ocfs2.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 4bcec184 09-Oct-2006 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: remove unused handle argument from ocfs2_meta_lock_full()

Now that this is unused and all callers pass NULL, we can safely remove it.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# daf29e9c 06-Oct-2006 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: remove unused ocfs2_handle_add_lock()

This gets us rid of a slab we no longer need, as well as removing the
majority of what's left on ocfs2_journal_handle.

ocfs2_commit_unstarted_handle() has no more real work to do, so remove that
function too.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 02928a71 06-Oct-2006 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: remove unused ocfs2_handle_add_inode()

We can also delete the unused infrastructure which was once in place to
support this functionality. ocfs2_inode_private loses ip_handle and
ip_handle_list. ocfs2_journal_handle loses handle_list.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# c4028958 22-Nov-2006 David Howells <dhowells@redhat.com>

WorkStruct: make allyesconfig

Fix up for make allyesconfig.

Signed-Off-By: David Howells <dhowells@redhat.com>


# 2ecd05ae 11-Oct-2006 Alexey Dobriyan <adobriyan@gmail.com>

[PATCH] fs/*: use BUILD_BUG_ON

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 24c19ef4 22-Sep-2006 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Remove i_generation from inode lock names

OCFS2 puts inode meta data in the "lock value block" provided by the DLM.
Typically, i_generation is encoded in the lock name so that a deleted inode
on and a new one in the same block don't share the same lvb.

Unfortunately, that scheme means that the read in ocfs2_read_locked_inode()
is potentially thrown away as soon as the meta data lock is taken - we
cannot encode the lock name without first knowing i_generation, which
requires a disk read.

This patch encodes i_generation in the inode meta data lvb, and removes the
value from the inode meta data lock name. This way, the read can be covered
by a lock, and at the same time we can distinguish between an up to date and
a stale LVB.

This will help cold-cache stat(2) performance in particular.

Since this patch changes the protocol version, we take the opportunity to do
a minor re-organization of two of the LVB fields.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 1ba9da2f 08-Sep-2006 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: manually d_move() during ocfs2_rename()

Make use of FS_RENAME_DOES_D_MOVE to avoid a race condition that can occur
during ->rename() if we d_move() outside of the parent directory cluster
locks, and another node discovers the new name (created during the rename)
and unlinks it. d_move() will unconditionally rehash a dentry - which will
leave stale data in the system.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 101ebf25 02-May-2006 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: limit cluster bitmap information saved at mount

We were storing cluster count on the ocfs2_super structure, but never
actually using it so remove that. Also, we don't want to populate the
uptodate cache with the unlocked block read - it is technically safe as is,
but we should change it for correctness.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 78427043 04-May-2006 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: clean up some osb fields

Get rid of osb->uuid, osb->proc_sub_dir, and osb->osb_id. Those fields were
unused, or could easily be removed. As a result, we also no longer need
MAX_OSB_ID or ocfs2_globals_lock.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# a75a6e4c 04-May-2006 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: fix init of uuid_net_key

ocfs2_initialize_super() should be copying from the beginning of the uuid.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 781ee3e2 27-Apr-2006 Sunil Mushran <sunil.mushran@oracle.com>

ocfs2: Cleanup message prints

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 726c3342 23-Jun-2006 David Howells <dhowells@redhat.com>

[PATCH] VFS: Permit filesystem to perform statfs with a known root dentry

Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.

This complements the get_sb() patch. That reduced the significance of
sb->s_root, allowing NFS to place a fake root there. However, NFS does
require a dentry to use as a target for the statfs operation. This permits
the root in the vfsmount to be used instead.

linux/mount.h has been added where necessary to make allyesconfig build
successfully.

Interest has also been expressed for use with the FUSE and XFS filesystems.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 454e2398 23-Jun-2006 David Howells <dhowells@redhat.com>

[PATCH] VFS: Permit filesystem to override root dentry on mount

Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.

The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).

The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.

This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.

The patch also makes the following changes:

(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.

(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().

(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().

This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.

However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.

[*] Anonymous until discovered from another tree.

(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.

[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b0697053 03-Mar-2006 Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: don't use MLF* in the file system

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# fffb60f9 24-Mar-2006 Paul Jackson <pj@sgi.com>

[PATCH] cpuset memory spread: slab cache format

Rewrap the overly long source code lines resulting from the previous
patch's addition of the slab cache flag SLAB_MEM_SPREAD. This patch
contains only formatting changes, and no function change.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 4b6a9316 24-Mar-2006 Paul Jackson <pj@sgi.com>

[PATCH] cpuset memory spread: slab cache filesystems

Mark file system inode and similar slab caches subject to SLAB_MEM_SPREAD
memory spreading.

If a slab cache is marked SLAB_MEM_SPREAD, then anytime that a task that's
in a cpuset with the 'memory_spread_slab' option enabled goes to allocate
from such a slab cache, the allocations are spread evenly over all the
memory nodes (task->mems_allowed) allowed to that task, instead of favoring
allocation on the node local to the current cpu.

The following inode and similar caches are marked SLAB_MEM_SPREAD:

file cache
==== =====
fs/adfs/super.c adfs_inode_cache
fs/affs/super.c affs_inode_cache
fs/befs/linuxvfs.c befs_inode_cache
fs/bfs/inode.c bfs_inode_cache
fs/block_dev.c bdev_cache
fs/cifs/cifsfs.c cifs_inode_cache
fs/coda/inode.c coda_inode_cache
fs/dquot.c dquot
fs/efs/super.c efs_inode_cache
fs/ext2/super.c ext2_inode_cache
fs/ext2/xattr.c (fs/mbcache.c) ext2_xattr
fs/ext3/super.c ext3_inode_cache
fs/ext3/xattr.c (fs/mbcache.c) ext3_xattr
fs/fat/cache.c fat_cache
fs/fat/inode.c fat_inode_cache
fs/freevxfs/vxfs_super.c vxfs_inode
fs/hpfs/super.c hpfs_inode_cache
fs/isofs/inode.c isofs_inode_cache
fs/jffs/inode-v23.c jffs_fm
fs/jffs2/super.c jffs2_i
fs/jfs/super.c jfs_ip
fs/minix/inode.c minix_inode_cache
fs/ncpfs/inode.c ncp_inode_cache
fs/nfs/direct.c nfs_direct_cache
fs/nfs/inode.c nfs_inode_cache
fs/ntfs/super.c ntfs_big_inode_cache_name
fs/ntfs/super.c ntfs_inode_cache
fs/ocfs2/dlm/dlmfs.c dlmfs_inode_cache
fs/ocfs2/super.c ocfs2_inode_cache
fs/proc/inode.c proc_inode_cache
fs/qnx4/inode.c qnx4_inode_cache
fs/reiserfs/super.c reiser_inode_cache
fs/romfs/inode.c romfs_inode_cache
fs/smbfs/inode.c smb_inode_cache
fs/sysv/inode.c sysv_inode_cache
fs/udf/super.c udf_inode_cache
fs/ufs/super.c ufs_inode_cache
net/socket.c sock_inode_cache
net/sunrpc/rpc_pipe.c rpc_inode_cache

The choice of which slab caches to so mark was quite simple. I marked
those already marked SLAB_RECLAIM_ACCOUNT, except for fs/xfs, dentry_cache,
inode_cache, and buffer_head, which were marked in a previous patch. Even
though SLAB_RECLAIM_ACCOUNT is for a different purpose, it marks the same
potentially large file system i/o related slab caches as we need for memory
spreading.

Given that the rule now becomes "wherever you would have used a
SLAB_RECLAIM_ACCOUNT slab cache flag before (usually the inode cache), use
the SLAB_MEM_SPREAD flag too", this should be easy enough to maintain.
Future file system writers will just copy one of the existing file system
slab cache setups and tend to get it right without thinking.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ac2b898c 22-Mar-2006 Christoph Lameter <clameter@engr.sgi.com>

[PATCH] slab: Remove SLAB_NO_REAP option

SLAB_NO_REAP is documented as an option that will cause this slab not to be
reaped under memory pressure. However, that is not what happens. The only
thing that SLAB_NO_REAP controls at the moment is the reclaim of the unused
slab elements that were allocated in batch in cache_reap(). Cache_reap()
is run every few seconds independently of memory pressure.

Could we remove the whole thing? Its only used by three slabs anyways and
I cannot find a reason for having this option.

There is an additional problem with SLAB_NO_REAP. If set then the recovery
of objects from alien caches is switched off. Objects not freed on the
same node where they were initially allocated will only be reused if a
certain amount of objects accumulates from one alien node (not very likely)
or if the cache is explicitly shrunk. (Strangely __cache_shrink does not
check for SLAB_NO_REAP)

Getting rid of SLAB_NO_REAP fixes the problems with alien cache freeing.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# b4df6ed8 22-Feb-2006 Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] ocfs2: fix orphan recovery deadlock

Orphan dir recovery can deadlock with another process in
ocfs2_delete_inode() in some corner cases. Fix this by tracking recovery
state more closely and allowing it to handle inode wipes which might
deadlock.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# ebdec83b 27-Jan-2006 Eric Sesterhenn / snakebyte <snakebyte@gmx.de>

[PATCH] BUG_ON() Conversion in fs/ocfs2/

this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# c74ec2f7 13-Jan-2006 Arjan van de Ven <arjan@infradead.org>

[PATCH] ocfs2: Semaphore to mutex conversion.

Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 251b6ecc 10-Jan-2006 Mark Fasheh <mark.fasheh@oracle.com>

[OCFS2] Make ip_io_sem a mutex

ip_io_sem is now ip_io_mutex.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


# 7892f2f4 09-Jan-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] mutex subsystem, semaphore to mutex: VFS, sb->s_lock

This patch converts the superblock-lock semaphore to a mutex, affecting
lock_super()/unlock_super(). Tested on ext3 and XFS.

Signed-off-by: Ingo Molnar <mingo@elte.hu>


# ccd979bd 15-Dec-2005 Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] OCFS2: The Second Oracle Cluster Filesystem

The OCFS2 file system module.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>