History log of /linux-master/fs/bcachefs/journal_reclaim.c
Revision Date Author Comments
# 6088234c 05-Apr-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: JOURNAL_SPACE_LOW

"bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant
performance regression (nearly 50%) on multithreaded random writes with
fio.

The reason is that the journal watermark checks multiple things,
including the state of the btree write buffer, and on multithreaded
update heavy workloads we're bottleneked on write buffer flushing - we
don't want kicknig off btree updates to depend on the state of the write
buffer.

This isn't strictly correct; the interior btree update path does do
write buffer updates, but it's a tiny fraction of total accounting
updates and we're more concerned with space in the journal itself.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f1ca1abf 13-Mar-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: pull out time_stats.[ch]

prep work for lifting out of fs/bcachefs/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4f70176c 31-Jan-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill unnecessary wakeups in journal reclaim

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 097471f9 17-Feb-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix bch2_journal_flush_device_pins()

If a journal write errored, the list of devices it was written to could
be empty - we're not supposed to mark an empty replicas list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4e074475 10-Feb-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Clamp replicas_required to replicas

This prevents going emergency read only when the user has specified
replicas_required > replicas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 41b84fb4 17-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: for_each_member_device_rcu() now declares loop iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 9fea2274 16-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: for_each_member_device() now declares loop iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# cf904c8d 16-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch_err_(fn|msg) check if should print

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 09caeabe 02-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree write buffer now slurps keys from journal

Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.

This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.

Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.

We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.

The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0ba9375a 07-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Unwritten journal buffers are always dirty

Ensure that journal bufs that haven't been written can't be reclaimed
from the journal pin fifo, and can thus have new pins taken.

Prep work for changing the btree write buffer to pull keys from the
journal directly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 066a2646 09-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: track_event_change()

This introduces a new helper for connecting time_stats to state changes,
i.e. when taking journal reservations is blocked for some reason.

We use this to track separately the different reasons the journal might
be blocked - i.e. space in the journal full, or the journal pin fifo
full.

Also do some cleanup and improvements on the time stats code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3eedfe1a 09-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Journal pins must always have a flush_fn

flush_fn is how we identify journal pins in debugfs - this is a
debugging aid.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# df8e13cc 06-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add an assertion in bch2_journal_pin_set()

Previously, bch2_journal_pin_set() would silently ignore a request to
pin a journal sequence number that was no longer dirty, because it was
used internally by bch2_journal_pin_copy() which could race with the src
pin being flushed.

Split these apart so that we can properly assert that @seq is a
currently dirty journal sequence number - this is almost always a bug.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a66ff26b 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Close journal entry if necessary when flushing all pins

Since outstanding journal buffers hold a journal pin, when flushing all
pins we need to close the current journal entry if necessary so its pin
can be released.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 006ccc30 04-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill journal pre-reservations

This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 92b63f5b 15-Sep-2023 Brian Foster <bfoster@redhat.com>

bcachefs: refactor pin put helpers

We have a couple journal pin put helpers to handle cases where the
journal lock is already held or not. Refactor the helpers to lock
and reclaim from the highest level and open code the reclaim from
the one caller of the internal variant. The latter call will be
moved into the journal buf release helper in a later patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 96dea3d5 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix W=12 build errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e46c181a 10-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Convert more code to bch_err_msg()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# fb8e5b4c 05-Aug-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: sb-members.c

Split out a new file for bch_sb_field_members - we'll likely want to
move more code here in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1e81f89b 06-Aug-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix assorted checkpatch nits

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 9a644843 08-Jul-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix error path in bch2_journal_flush_device_pins()

We need to always call bch2_replicas_gc_end() after we've called
bch2_replicas_gc_start(), else we leave state around that needs to be
cleaned up.

Partial fix for: https://github.com/koverstreet/bcachefs/issues/560

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 73bd774d 06-Jul-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Assorted sparse fixes

- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d14bfd10 30-Jun-2023 Brian Foster <bfoster@redhat.com>

bcachefs: mark active journal devices on journal replicas gc

A simple device evacuate, remove, add test loop with concurrent
shutdowns occasionally reproduces a problem where the filesystem
fails to mount. The mount failure occurs because the filesystem was
uncleanly shut down, yet no member device is marked for journal data
in the superblock. An fsck detects the problem, restores the mark
and allows the mount to proceed without further consistency issues.

The reason for the lack of journal data marks is the gc mechanism
invoked via bch2_journal_flush_device_pins() runs while the journal
happens to be empty. This results in garbage collection of all journal
replicas entries. Once the updated replicas table is written to the
superblock, the filesystem is put in a transiently unrecoverable state
until further journal data is written, because journal recovery expects
to find at least one marked journal device whenever the filesystem is
not otherwise marked clean (i.e. as on clean unmount).

To fix this problem, update the journal replicas gc algorithm to always
mark currently active journal replicas entries by writing to the
journal. This ensures that only entries for devices that are no longer
used for journaling are garbage collected, not just those that don't
happen to currently hold journal data. This preserves the journal
recovery invariant above and avoids putting the fs into a transiently
unrecoverable state.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 19c304be 28-May-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: GFP_NOIO -> GFP_NOFS

GFP_NOIO dates from the bcache days, when we operated under the block
layer. Now, GFP_NOFS is more appropriate, so switch all GFP_NOIO uses to
GFP_NOFS.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 030e9f92 21-Mar-2023 Brian Foster <bfoster@redhat.com>

bcachefs: drop unnecessary journal stuck check from space calculation

The journal stucking check in bch2_journal_space_available() is
particularly aggressive and can lead to premature shutdown in some
rare cases. This is difficult to reproduce, but also comes along
with a fatal error and so is worthwhile to be cautious.

For example, we've seen instances where the journal is under heavy
reservation pressure, the journal allocation path transitions into
the final available journal bucket, the journal write path
immediately consumes that bucket and calls into
bch2_journal_space_available(), which then in turn flags the journal
as stuck because there is no available space and shuts down the
filesystem instead of submitting the journal write (that would have
otherwise succeeded).

To avoid this problem, simplify the journal stuck checking by just
relying on the higher level logic in the journal reservation path.
This produces more useful debug output and is a more reliable
indicator that things have bogged down.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 83ec519a 07-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: When shutting down, flush btree node writes last

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 637de729 11-Nov-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Ensure btree node cache is not more than half dirty

Tweak journal reclaim to ensure the btree node cache isn't more
than half dirty so that memory reclaim can always make progress - the
same as we do for the btree key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# a2b9a5b2 06-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix bch2_journal_flush_device_pins()

It's now legal for the pin fifo to be empty, which means this code needs
to be updated in order to not hit an assert.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 3e3e02e6 19-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Assorted checkpatch fixes

checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:

ASSIGN_IN_IF
UNSPECIFIED_INT - bcachefs coding style prefers single token type names
NEW_TYPEDEFS - typedefs are occasionally good
FUNCTION_ARGUMENTS - we prefer to look at functions in .c files
(hopefully with docbook documentation), not .h
file prototypes
MULTISTATEMENT_MACRO_USE_DO_WHILE
- we have _many_ x-macros and other macros where
we can't do this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 674cfc26 26-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add persistent counters for all tracepoints

Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d4bf5eec 18-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use bch2_err_str() in error messages

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 1f93726e 17-Apr-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Tracepoint improvements

Delete some obsolete tracepoints, organize alloc tracepoints better,
make a few tracepoints more consistent.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 31f63fd1 14-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Introduce a separate journal watermark for copygc

Since journal reclaim -> btree key cache flushing may require the
allocation of new btree nodes, it has an implicit dependency on copygc
in order to make forward progress - so we should avoid blocking copygc
unless the journal is really close to full.

This introduces watermarks to replace our single MAY_GET_UNRESERVED bit
in the journal, and adds a watermark for copygc and plumbs it through.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 70a9953c 05-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix bch2_journal_pin_set()

When bch2_journal_pin_set() is updating an existing pin, we shouldn't
call bch2_journal_reclaim_fast() after dropping the old pin and before
dropping the new pin - that could reclaim the entry we're trying to pin.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 7fda0f08 28-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Work around a journal self-deadlock

bch2_journal_space_available -> bch2_journal_halt() self deadlocks on
journal lock; work around this by dropping/retaking journal lock before
we call bch2_fatal_error().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 718ce1eb 06-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Skip periodic wakeup of journal reclaim when journal empty

Less system noise.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 30ef633a 28-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Refactor journal code to not use unwritten_idx

It makes the code more readable if we work off of sequence numbers,
instead of direct indexes into the array of journal buffers.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# f0a3a2cc 28-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Journal seq now incremented at entry open, not close

This patch changes journal_entry_open() to initialize the new journal
entry, not __journal_entry_close().

This also means that journal_cur_seq() refers to the sequence number of
the last journal entry when we don't have an open journal entry, not the
next one.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 2975cd47 25-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't spin in journal reclaim

If we're not able to flush anything, we shouldn't keep looping.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# cb598111 25-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix journal_flush_done()

journal_flush_done() was overwriting did_work, thus occasionally
returning false when it did do work and occasional assertions in the
shutdown sequence because we didn't completely flush the key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# fa8e94fa 25-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Heap allocate printbufs

This patch changes printbufs dynamically allocate and reallocate a
buffer as needed. Stack usage has become a bit of a problem, and a major
cause of that has been static size string buffers on the stack.

The most involved part of this refactoring is that printbufs must now be
exited with printbuf_exit().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b66b2bc0 23-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Revert "Ensure journal doesn't get stuck in nochanges mode"

This patch was originally to work around the journal geting stuck in
nochanges mode - but that was just a hack, we needed to fix the actual
bug. It should be fixed now, so revert it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 3117db99 21-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't issue discards when in nochanges mode

When the nochanges option is selected, we're supposed to never issue
writes. Unfortunately, it seems discards were missed when implemnting
this, leading to some painful filesystem corruption.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# d8601afc 27-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Simplify journal replay

With BTREE_ITER_WITH_JOURNAL, there's no longer any restrictions on the
order we have to replay keys from the journal in, and we can also start
up journal reclaim right away - and delete a bunch of code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 2430e72f 04-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Convert journal sysfs params to regular options

This converts journal_write_delay, journal_flush_disabled, and
journal_reclaim_delay to normal filesystems options, and also adds them
to the superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# fae1157d 28-Oct-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Ensure journal doesn't get stuck in nochanges mode

This tweaks the journal code to always act as if there's space available
in nochanges mode, when we're not going to be doing any writes. This
helps in recovering filesystems that won't mount because they need
journal replay and the journal has gotten stuck.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 6a0f414e 16-Oct-2021 Brett Holman <bholman.devel@gmail.com>

bcachefs: Fix compiler warnings

Type size_t is architecture-specific. Fix warnings for some non-amd64
arches.

Signed-off-by: Brett Holman <bholman.devel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d7fc453b 30-May-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Journal space calculation fix

When devices have different bucket sizes, we may accumulate a journal
write that doesn't fit on some of our devices - previously, we'd
underflow when calculating space on that device and then everything
would get weird.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 2ce867df 28-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Make sure to initialize j->last_flushed

If the journal reclaim thread makes it to the timeout without ever
initializing j->last_flushed, we could end up sleeping for a very long
time.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f09517fc 20-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a deadlock on journal reclaim

Flushing the btree key cache needs to use allocation reserves - journal
reclaim depends on flushing the btree key cache for making forward
progress, and the allocator and copygc depend on journal reclaim making
forward progress.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 96f399d0 15-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix journal reclaim loop

When dirty key cache keys were separated from other journal pins, we
broke the loop conditional in __bch2_journal_reclaim() - it's supposed
to keep looping as long as there's work to do.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 241e2636 31-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't flush btree writes more aggressively because of btree key cache

We need to flush the btree key cache when it's too dirty, because
otherwise the shrinker won't be able to reclaim memory - this is done by
journal reclaim. But journal reclaim also kicks btree node writes: this
meant that btree node writes were getting kicked much too often just
because we needed to flush btree key cache keys.

This patch splits journal pins into two different lists, and teaches
journal reclaim to not flush btree node writes when it only needs to
flush key cache keys.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2940295c 03-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Be more careful about JOURNAL_RES_GET_RESERVED

JOURNAL_RES_GET_RESERVED should only be used for updatse that need to be
done to free up space in the journal. In particular, when we're flushing
keys from the key cache, if we're flushing them out of order we
shouldn't be using it, since we're using up our remaining space in the
journal without dropping a pin that will let us make forward progress.

With this patch, BTREE_INSERT_JOURNAL_RECLAIM without
BTREE_INSERT_JOURNAL_RESERVED may return -EAGAIN - we can't wait on
journal reclaim if we're already in journal reclaim.

This means we need to propagate these errors up to journal reclaim,
indicating that flushing a journal pin should be retried in the future.

This is prep work for a patch to change the way journal reclaim works,
to split out flushing key cache keys because the btree key cache is too
dirty from journal reclaim because we need space in the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 24db24c7 31-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't make foreground writes wait behind journal reclaim too long

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c5f51cdd 28-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Have journal reclaim thread flush more aggressively

This adds a new watermark for the journal reclaim when flushing btree
key cache entries - it should try and stay ahead of where foreground
threads doing transaction commits will enter direct journal reclaim.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 331194a2 24-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree key cache locking improvements

The btree key cache mutex was becoming a significant bottleneck - it was
mainly used to protect the lists of dirty, clean and freed cached keys.

This patch eliminates the dirty and clean lists - instead, when we need
to scan for keys to drop from the cache we iterate over the rhashtable,
and thus we're able to remove most uses of that lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# dab9ef0d 23-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add error message for some allocation failures

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d483dd17 16-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix race between journal_seq_copy() and journal_seq_drop()

In bch2_btree_interior_update_will_free_node, we copy the journal pins
from outstanding writes on the btree node we're about to free. But, this
can race with the writes completing, and dropping their journal pins.

To guard against this, just use READ_ONCE() in bch2_journal_pin_copy().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b18df768 06-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Prevent journal reclaim from spinning

Without checking if we actually flushed anything, journal reclaim could
still go into an infinite loop while trying ot shut down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f51e84fe 05-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix btree key cache dirty checks

Had a type that meant we were triggering journal reclaim _much_ more
aggressively than needed. Also, fix a potential integer overflow.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5d32c5bb 05-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Be more conservation about journal pre-reservations

- Try to always keep 1/8th of the journal free, on top of
pre-reservations
- Move the check for whether the journal is stuck to
bch2_journal_space_available, and make it only fire when there aren't
any journal writes in flight (that might free up space by updating
last_seq)

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# adbcada4 14-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't require flush/fua on every journal write

This patch adds a flag to journal entries which, if set, indicates that
they weren't done as flush/fua writes.

- non flush/fua journal writes don't update last_seq (i.e. they don't
free up space in the journal), thus the journal free space
calculations now check whether nonflush journal writes are currently
allowed (i.e. are we low on free space, or would doing a flush write
free up a lot of space in the journal)

- write_delay_ms, the user configurable option for when open journal
entries are automatically written, is now interpreted as the max
delay between flush journal writes (default 1 second).

- bch2_journal_flush_seq_async is changed to ensure a flush write >=
the requested sequence number has happened

- journal read/replay must now ignore, and blacklist, any journal
entries newer than the most recent flush entry in the journal. Also,
the way the read_entire_journal option is handled has been improved;
struct journal_replay now has an entry, 'ignore', for entries that
were read but should not be used.

- assorted refactoring and improvements related to journal read in
journal_io.c and recovery.c

Previously, we'd have to issue a flush/fua write every time we
accumulated a full journal entry - typically the bucket size. Now we
need to issue them much less frequently: when an fsync is requested, or
it's been more than write_delay_ms since the last flush, or when we need
to free up space in the journal. This is a significant performance
improvement on many write heavy workloads.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b6df4325 13-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve journal free space calculations

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ebb84d09 13-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Increase journal pipelining

This patch increases the maximum journal buffers in flight from 2 to 4 -
this will be particularly helpful when in the future we stop requiring
flush+fua for every journal write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# afa7cb0c 03-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Check for errors in bch2_journal_reclaim()

If the journal is halted, journal reclaim won't necessarily be able to
make any forward progress, and won't accomplish anything anyways - we
should bail out so that we don't get stuck looping in reclaim when the
caches are too dirty and we should be shutting down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 231db03c 01-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Journal pin refactoring

This deletes some duplicated code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5731cf01 29-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix journal reclaim spinning in recovery

We can't run journal reclaim until we've finished replaying updates to
interior btree nodes - the check for this was in the wrong place though,
leading to journal reclaim spinning before it was allowed to proceed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b7a9bbfc 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Move journal reclaim to a kthread

This is to make tracing easier.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 9d4582ff 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Journal reclaim requires memalloc_noreclaim_save()

Memory reclaim requires journal reclaim to make forward progress - it's
what cleans our caches - thus, while we're in journal reclaim or holding
the journal reclaim lock we can't recurse into memory reclaim.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8a92e545 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Ensure journal reclaim runs when btree key cache is too dirty

Ensuring the key cache isn't too dirty is critical for ensuring that the
shrinker can reclaim memory.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ed0e24c0 18-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Be more precise with journal error reporting

We were incorrectly detecting a journal deadlock - the journal filling
up - when only the journal pin fifo had filled up; if the journal pin
fifo is full that just means we need to wait on reclaim.

This plumbs through better error reporting so we can better discriminate
in the journal_res_get path what's going on.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f526d26d 11-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix btree key cache shutdown

On emergency shutdown, we might still have dirty keys in the btree key
cache that need to be cleaned up properly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6a747c46 09-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add accounting for dirty btree nodes/keys

This lets us improve journal reclaim, so that it now tries to make sure
no more than 3/4s of the btree node cache and btree key cache are dirty
- ensuring the shrinkers can free memory.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2f33ece9 02-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Minor journal reclaim improvement

With the btree key cache code, journal reclaim now has a lot more work
to do. It could be the case that after journal reclaim has finished one
iteration there's already more work to do, so put it in a loop to check
for that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 89fd25be 09-Jul-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use x-macros for data types

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5d20ba48 04-Oct-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use cached iterators for alloc btree

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2ca88e5a 07-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree key cache

This introduces a new kind of btree iterator, cached iterators, which
point to keys cached in a hash table. The cache also acts as a write
cache - in the update path, we journal the update but defer updating the
btree until the cached entry is flushed by journal reclaim.

Cache coherency is for now up to the users to handle, which isn't ideal
but should be good enough for now.

These new iterators will be used for updating inodes and alloc info (the
alloc and stripes btrees).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a27443bc 03-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill old allocator startup code

It's not needed anymore since we can now write to buckets before
updating the alloc btree.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 039fc4c5 28-May-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fixes for going RO

Now that interior btree updates are fully transactional, we don't need
to write out alloc info in a loop. However, interior btree updates do
put more things in the journal, so we still need a loop in the RO
sequence.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 00b8ccf7 25-May-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Interior btree updates are now fully transactional

We now update the alloc info (bucket sector counts) atomically with
journalling the update to the interior btree nodes, and we also set new
btree roots atomically with the journalled part of the btree update.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 94035eed 10-Apr-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a locking bug in bch2_journal_pin_copy()

There was a race where the src pin would be flushed - releasing the last
pin on that sequence number - before adding the new journal pin. Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3f58a197 27-Feb-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Journal pin cleanups

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b5d05635 07-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: minor journal reclaim fixes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 68ef94a6 19-Feb-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a pre-reserve mechanism for the journal

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 9ace606e 28-Feb-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't block on reclaim_lock from journal_res_get

When we're doing btree updates from journal flush, this becomes a
locking inversion

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 03d5eaed 03-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_journal_space_available improvements

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2384db8f 03-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Separate discards from rest of journal reclaim

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0ce2dbbe 03-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: ja->discard_idx, ja->dirty_idx

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# dc9aa178 01-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Drop a faulty assertion

the assertion was meant to check that bch2_journal_reclaim_fast() was
always being called, but since the atomic dec can happen outside of
j->lock the assertion itself can race

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e5a66496 21-Feb-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Journal reclaim refactoring

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 7ef2a73a 21-Jan-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix check for if extent update is allocating

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0519b72d 18-Jan-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a workqueue for journal reclaim

journal reclaim writes btree nodes, which can end up waiting for in
flight btree writes to complete, and btree write completions run out of
workqueues - so we can't run out of the same workqueue or we risk
deadlock

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 000de459 13-Jan-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: fixes for getting stuck flushing journal pins

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3636ed48 17-Jul-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Deferred btree updates

Will be used in the future for inode updates, which will be very helpful
for multithreaded workloads that have to update the inode with every
extent update (appends, or updates that change i_sectors)

Also will be used eventually for fully persistent alloc info

However - we still need a mechanism for reserving space in the journal
prior to getting a journal reservation, so it's not technically safe to
make use of this just yet, we could deadlock with the journal full
(although not likely to be an issue in practice)

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a9ec3454 18-Nov-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Journal refactoring

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4077991c 16-Jul-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a use after free in the journal code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1c6fdbd8 17-Mar-2017 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Initial commit

Initially forked from drivers/md/bcache, bcachefs is a new copy-on-write
filesystem with every feature you could possibly want.

Website: https://bcachefs.org

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>