History log of /linux-master/fs/bcachefs/btree_trans_commit.c
Revision Date Author Comments
# 82cf18f2 12-Apr-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix deadlock in journal replay

btree_key_can_insert_cached() should be checking the watermark -
BCH_TRANS_COMMIT_journal_replay really means nonblocking mode when
watermark < reclaim, it was being used incorrectly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 58caa786 11-Apr-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix UAFs of btree_insert_entry array

The btree paths array is now dynamically resizable - and as well the
btree_insert_entries array, as it needs to be the same size.

The merge path (and interior update path) allocates new btree paths,
thus can trigger a resize; thus we need to not retain direct pointers
after invoking merge; similarly when running btree node triggers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e2a316b3 01-Apr-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: BCH_WATERMARK_interior_updates

This adds a new watermark, higher priority than BCH_WATERMARK_reclaim,
for interior btree updates. We've seen a deadlock where journal replay
triggers a ton of btree node merges, and these use up all available open
buckets and then interior updates get stuck.

One cause of this is that we're currently lacking btree node merging on
write buffer btrees - that needs to be fixed as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ec9cc18f 22-Mar-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add checks for invalid snapshot IDs

Previously, we assumed that keys were consistent with the snapshots
btree - but that's not correct as fsck may not have been run or may not
be complete.

This adds checks and error handling when using the in-memory snapshots
table (that mirrors the snapshots btree).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 7be0208f 17-Jan-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: add missing __GFP_NOWARN

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ec4edd7b 16-Jan-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Prep work for variable size btree node buffers

bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.

And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.

Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5b14ce35 11-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_trans_account_disk_usage_change()

The disk space accounting rewrite is splitting out accounting for each
replicas set - those are moving to btree keys, instead of percpu
counters.

This breaks bch2_trans_fs_usage_apply() up, splitting out the part we
will still need.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 38c23fb8 07-Jan-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: BTREE_TRIGGER_ATOMIC

Add a new flag to be explicit about when we're running atomic triggers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f0431c5f 31-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Combine .trans_trigger, .atomic_trigger

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ad00bce0 27-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: mark now takes bkey_s

Prep work for disk space accounting rewrite: we're going to want to use
a single callback for both of our current triggers, so we need to change
them to have the same type signature first.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 717296c3 27-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: trans_mark now takes bkey_s

Prep work for disk space accounting rewrite: we're going to want to use
a single callback for both of our current triggers, so we need to change
them to have the same type signature first.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5e329145 20-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Check journal entries for invalid keys in trans commit path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ff70ad2c 15-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix interior update path btree_path uses

Since the btree_paths array is now about to become growable, we have to
be careful not to refer to paths by pointer across contexts where they
may be reallocated.

This fixes the remaining btree_interior_update() paths - split and
merge.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6474b706 11-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Clean up btree_trans

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 7f9821a7 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_insert_entry -> btree_path_idx_t

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 559e6c23 16-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: trans_for_each_update() now declares loop iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 67997234 11-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: kill btree_trans->wb_updates

the btree write buffer path now creates a journal entry directly

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 09caeabe 02-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree write buffer now slurps keys from journal

Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.

This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.

Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.

We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.

The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 24de63da 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improve trans->extra_journal_entries

Instead of using a darray, we now allocate journal entries for the
transaction commit path with our normal bump allocator - with an inlined
fastpath, and using btree_transaction_stats to remember how much to
initially allocate so as to avoid transaction restarts.

This is prep work for converting write buffer updates to use this
mechanism.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d3083cf2 02-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_write_buffer_flush_locked()

Minor refactoring - improved naming, and move the responsibility for
flush_lock to the caller instead of having it be shared.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 183bcc89 02-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Clean up btree write buffer write ref handling

__bch2_btree_write_buffer_flush() now assumes a write ref is already
held (as called by the transaction commit path); and the wrappers
bch2_write_buffer_flush() and flush_sync() take an explicit write ref.

This means internally the write buffer code can always use
BTREE_INSERT_NOCHECK_RW, instead of in the previous code passing flags
around and hoping the NOCHECK_RW flag was always carried around
correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3c471b65 26-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: convert bch_fs_flags to x-macro

Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# cb52d23e 11-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Rename BTREE_INSERT flags

BTREE_INSERT flags are actually transaction commit flags - rename them
for clarity.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# aa62aabb 11-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill dead BTREE_INSERT flags

BTREE_INSERT_NOWAIT and BTREE_INSERT_GC_LOCK_HELD are no longer used,
and can be deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 43c7ede0 08-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill BTREE_UPDATE_PREJOURNAL

With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can
now switch the btree write buffer to use it for flushing.

This has the advantage that transaction commits don't need to take a
journal reservation at all.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 9a71de67 08-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: BTREE_INSERT_JOURNAL_REPLAY now "don't init trans->journal_res"

This slightly changes how trans->journal_res works, in preparation for
changing the btree write buffer flush path to use it.

Now, BTREE_INSERT_JOURNAL_REPLAY means "don't take a journal
reservation; trans->journal_res.seq already refers to the journal
sequence number to pin".

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 389c92b3 07-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Clear k->needs_whitout earlier in commit path

The upcoming btree write buffer rework is going to use the journal
itself as the first stage of the write buffer; this is a cleanup to make
sure k->needs_whiteout is initialized before keys hit the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3eedfe1a 09-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Journal pins must always have a flush_fn

flush_fn is how we identify journal pins in debugfs - this is a
debugging aid.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 09e0153b 23-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix warning when building in userspace

bch_err() doesn't reference the fs arg in userspace

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 006ccc30 04-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill journal pre-reservations

This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 09b0283e 05-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Make sure to drop/retake btree locks before reclaim

We really don't want to be invoking memory reclaim with btree locks
held: even aside from (solvable, but tricky) recursion issues, it can
cause painful to diagnose performance edge cases.

This fixes a recently reported issue in btree_key_can_insert_cached().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: Mateusz Guzik <mjguzik@gmail.com>
Fixes: https://lore.kernel.org/linux-bcachefs/CAGudoHEsb_hGRMeWeXh+UF6po0qQuuq_NKSEo+s1sEb6bDLjpA@mail.gmail.com/T/


# 3b8c4507 06-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_trans->write_locked

As prep work for the next patch to fix a key cache reclaim issue, we
need to start tracking whether we're currently holding write locks - so
that we can release and retake the before calling into memory reclaim.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d3c7727b 03-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: rebalance_work btree is not a snapshots btree

rebalance_work entries may refer to entries in the extents btree, which
is a snapshots btree, or they may also refer to entries in the reflink
btree, which is not.

Hence rebalance_work keys may use the snapshot field but it's not
required to be nonzero - add a new btree flag to reflect this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6dfa10ab 31-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix build errors with gcc 10

gcc 10 seems to complain about array bounds in situations where gcc 11
does not - curious.

This unfortunately requires adding some casts for now; we may
investigate getting rid of our __u64 _data[] VLA in a future patch so
that our start[0] members can be VLAs.

Reported-by: John Stoffel <john@stoffel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# be9e782d 27-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't downgrade locks on transaction restart

We should only be downgrading locks on success - otherwise, our
transaction restarts won't be getting the correct locks and we'll
livelock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 523f33ef 22-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: All triggers are BTREE_TRIGGER_WANTS_OLD_AND_NEW

Upcoming rebalance_work btree will require extent triggers to be
BTREE_TRIGGER_WANTS_OLD_AND_NEW - so to reduce potential confusion,
let's just make all triggers BTREE_TRIGGER_WANTS_OLD_AND_NEW.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 50a38ca1 19-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix btree_node_type enum

More forwards compatibility fixups: having BKEY_TYPE_btree at the end of
the enum conflicts with unnkown btree IDs, this shifts BKEY_TYPE_btree
to slot 0 and fixes things up accordingly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 88dfe193 19-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_id_str()

Since we can run with unknown btree IDs, we can't directly index btree
IDs into fixed size arrays.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b560e32e 23-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Always check for invalid bkeys in main commit path

Previously, we would check for invalid bkeys at transaction commit time,
but only if CONFIG_BCACHEFS_DEBUG=y.

This check is important enough to always be on - it appears there's been
corruption making it into the journal that would have been caught by it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6bd68ec2 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Heap allocate btree_trans

We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 96dea3d5 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix W=12 build errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# cc07773f 22-Aug-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Put bkey invalid check in commit path in a more useful place

When doing updates early in recovery, before we can go RW, we still want
to check that keys are valid at commit time - this moves key invalid
checking to before the "btree updates to journal" path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4491283f 22-Aug-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix a double free on invalid bkey

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# da525760 21-Aug-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix btree write buffer with snapshots btrees

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8e877caa 16-Aug-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Split out snapshot.c

subvolume.c has gotten a bit large, this splits out a separate file just
for managing snapshot trees - BTREE_ID_snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 401585fe 05-Aug-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_journal_iter.c

Split out a new file from recovery.c for managing the list of keys we
read from the journal: before journal replay finishes the btree iterator
code needs to be able to iterate over and return keys from the journal
as well, so there's a fair bit of code here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8079aab0 04-Aug-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Split up btree_update_leaf.c

We now have
btree_trans_commit.c
btree_update.c

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>