History log of /linux-master/fs/bcachefs/btree_write_buffer.c
Revision Date Author Comments
# 86dbf8c5 26-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix btree node merging on write buffer btrees

The btree write buffer flush fastpath that avoids the main transaction
commit path had the unfortunate side effect of not doing btree node
merging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8aad8e1f 22-Mar-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix journal pins in btree write buffer

btree write buffer flush has two phases
- in natural key order, which is more efficient but may fail
- then in journal order

The journal order flush was assuming that keys were still correctly
ordered by journal sequence number - but due to coalescing by the
previous phase, we need an additional sort.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3ed94062 17-Mar-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improve bch2_fatal_error()

error messages should always include __func__

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0b5961b0 23-Feb-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: jset_entry for loops declare loop iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d9290c99 17-Feb-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix journal_buf bitfield accesses

All jounal_buf bitfield updates must happen under the journal lock -
perhaps we should just switch these to atomic bit flags.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ec4edd7b 16-Jan-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Prep work for variable size btree node buffers

bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.

And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.

Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8feaebb0 27-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: __bch2_journal_key_to_wb -> bch2_journal_key_to_wb_slowpath

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 37165014 27-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: wb_key_cmp -> wb_key_ref_cmp

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 07f383c7 03-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_iter -> btree_path_idx_t

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f6363aca 08-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_path_make_mut() -> btree_path_idx_t

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# defd9e39 16-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: darray_for_each() now declares loop iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 38ced43b 03-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Inline btree write buffer sort

The sort in the btree write buffer flush path is a very hot path, and
it's particularly performance sensitive since it's single threaded and
can block every other thread on a multithreaded write workload.

It's well worth doing a sort with inlined cmp and swap functions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 09caeabe 02-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree write buffer now slurps keys from journal

Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.

This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.

Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.

We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.

The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8a4b4c52 26-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: more write buffer refactoring

prep work for big rewrite - no functional changes in this patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ab4fb4b6 26-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: wb_flush_one_slowpath()

A bit of refactoring for better inlining in the main btree write buffer
flush path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# cb13f471 02-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_write_buffer_flush() -> bch2_btree_write_buffer_tryflush()

More accurate naming.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d3083cf2 02-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_write_buffer_flush_locked()

Minor refactoring - improved naming, and move the responsibility for
flush_lock to the caller instead of having it be shared.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 183bcc89 02-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Clean up btree write buffer write ref handling

__bch2_btree_write_buffer_flush() now assumes a write ref is already
held (as called by the transaction commit path); and the wrappers
bch2_write_buffer_flush() and flush_sync() take an explicit write ref.

This means internally the write buffer code can always use
BTREE_INSERT_NOCHECK_RW, instead of in the previous code passing flags
around and hoping the NOCHECK_RW flag was always carried around
correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 56db2429 02-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improve btree write buffer tracepoints

- add a tracepoint for write_buffer_flush_sync; this is expensive
- fix the write_buffer_flush_slowpath tracepoint

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# cb52d23e 11-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Rename BTREE_INSERT flags

BTREE_INSERT flags are actually transaction commit flags - rename them
for clarity.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e17b93eb 07-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Avoiding dropping/retaking write locks in bch2_btree_write_buffer_flush_one()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 43c7ede0 08-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill BTREE_UPDATE_PREJOURNAL

With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can
now switch the btree write buffer to use it for flushing.

This has the advantage that transaction commits don't need to take a
journal reservation at all.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3eedfe1a 09-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Journal pins must always have a flush_fn

flush_fn is how we identify journal pins in debugfs - this is a
debugging aid.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6bd68ec2 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Heap allocate btree_trans

We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# da525760 21-Aug-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix btree write buffer with snapshots btrees

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 60a5b898 19-Jul-2023 Brian Foster <bfoster@redhat.com>

bcachefs: use prejournaled key updates for write buffer flushes

The write buffer mechanism journals keys twice in certain
situations. A key is always journaled on write buffer insertion, and
is potentially journaled again if a write buffer flush falls into
either of the slow btree insert paths. This has shown to cause
journal recovery ordering problems in the event of an untimely
crash.

For example, consider if a key is inserted into index 0 of a write
buffer, the active write buffer switches to index 1, the key is
deleted in index 1, and then index 0 is flushed. If the original key
is rejournaled in the btree update from the index 0 flush, the (now
deleted) key is journaled in a seq buffer ahead of the latest
version of key (which was journaled when the key was deleted in
index 1). If the fs crashes while this is still observable in the
log, recovery sees the key from the btree update after the delete
key from the write buffer insert, which is the incorrect order. This
problem is occasionally reproduced by generic/388 and generally
manifests as one or more backpointer entry inconsistencies.

To avoid this problem, never rejournal write buffered key updates to
the associated btree. Instead, use prejournaled key updates to pass
the journal seq of the write buffer insert down to the btree insert,
which updates the btree leaf pin to reflect the seq of the key.

Note that tracking the seq is required instead of just using
NOJOURNAL here because otherwise we lose protection of the write
buffer pin when the buffer is flushed, which means the key can fall
off the tail of the on-disk journal before the btree leaf is flushed
and lead to similar recovery inconsistencies.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d82978ca 12-Jul-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add a race_fault() for write buffer slowpath

We haven't hooked up dynamic fault injection quite yet, but we will soon

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f33c58fc 27-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill BTREE_INSERT_USE_RESERVE

Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ec14fc60 27-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill JOURNAL_WATERMARK

This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8e5b1115 11-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Write buffer flush needs BTREE_INSERT_NOCHECK_RW

btree write buffer flush is only invoked from contexts that already hold
a write ref, and checking if we're still RW could cause us to fail to
completely flush the write buffer when shutting down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 873555f0 17-Mar-2023 Brian Foster <bfoster@redhat.com>

bcachefs: more aggressive fast path write buffer key flushing

The btree write buffer flush code is prone to causing journal
deadlock due to inefficient use and release of reservation space.
Reservation is not pre-reserved for write buffered keys (as is done
for key cache keys, for example), because the write buffer flush
side uses a fast path that attempts insertion without need for any
reservation at all.

The write buffer flush attempts to deal with this by inserting keys
using the BTREE_INSERT_JOURNAL_RECLAIM flag to return an error on
journal reservations that require blocking. Upon first error, it
falls back to a slow path that inserts in journal order and supports
moving the associated journal pin forward.

The problem is that under pathological conditions (i.e. smaller log,
larger write buffer and journal reservation pressure), we've seen
instances where the fast path fails fairly quickly without having
completed many insertions, and then the slow path is unable to push
the journal pin forward enough to free up the space it needs to
completely flush the buffer. This problem is occasionally reproduced
by fstest generic/333.

To avoid this problem, update the fast path algorithm to skip key
inserts that fail due to inability to acquire needed journal
reservation without immediately breaking out of the loop. Instead,
insert as many keys as possible, zap the sequence numbers to mark
them as processed, and then fall back to the slow path to process
the remaining set in journal order. This reduces the amount of
journal reservation that might be required to flush the entire
buffer and increases the odds that the slow path is able to move the
journal pin forward and free up space as keys are processed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 65d48e35 14-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Private error codes: ENOMEM

This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 747ded6d 26-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix for shared paths in write buffer flush

It's possible for bch2_write_buffer_flush_one() to end up with a shared
path, if called from a context that already has a btree iterator
pointing to a key being flushed. We have to be careful when that
happens, since we can't clone a path that holds write locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8ffa11a2 19-Jan-2023 Daniel Hill <daniel@gluo.nz>

bcachefs: let __bch2_btree_insert() pass in flags

This patch is prep work for the following patch.

Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 920e69bc 03-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree write buffer

This adds a new method of doing btree updates - a straight write buffer,
implemented as a flat fixed size array.

This is only useful when we don't need to read from the btree in order
to do the update, and when reading is infrequent - perfect for the LRU
btree.

This will make LRU btree updates fast enough that we'll be able to use
it for persistently indexing buckets by fragmentation, which will be a
massive boost to copygc performance.

Changes:
- A new btree_insert_type enum, for btree_insert_entries. Specifies
btree, btree key cache, or btree write buffer.

- bch2_trans_update_buffered(): updates via the btree write buffer
don't need a btree path, so we need a new update path.

- Transaction commit path changes:
The update to the btree write buffer both mutates global, and can
fail if there isn't currently room. Therefore we do all write buffer
updates in the transaction all at once, and also if it fails we have
to revert filesystem usage counter changes.

If there isn't room we flush the write buffer in the transaction
commit error path and retry.

- A new persistent option, for specifying the number of entries in the
write buffer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>