History log of /linux-master/fs/bcachefs/btree_types.h
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 6e4d9bd1 20-Apr-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bkey_cached.btree_trans_barrier_seq needs to be a ulong

this stores the SRCU sequence number, which we use to check if an SRCU
barrier has elapsed; this is a partial fix for the key cache shrinker
not actually freeing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# be42e4a6 04-Apr-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Bump limit in btree_trans_too_many_iters()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 91dcad18 22-Jan-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Pin btree cache in ram for random access in fsck

Various phases of fsck involve checking references from one btree to
another: this means doing a sequential scan of one btree, and then
mostly random access into the second.

This is particularly painful for checking extents <-> backpointers; we
can prefetch btree node access on the sequential scan, but not on the
random access portion, and this is particularly painful on spinning
rust, where we'd like to keep the pipeline fairly full of btree node
reads so that the elevator can reduce seeking.

This patch implements prefetching and pinning of the portion of the
btree that we'll be doing random access to. We already calculate how
much of the random access btree will fit in memory so it's a fairly
straightforward change.

This will put more pressure on system memory usage, so we introduce a
new option, fsck_memory_usage_percent, which is the percentage of total
system ram that fsck is allowed to pin.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b26d7914 21-Jan-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: BTREE_ID_subvolume_children

Add a btree to record a parent -> child subvolume relationships,
according to the filesystem heirarchy.

The subvolume_children btree is a bitset btree: if a bit is set at pos
p, that means p.offset is a child of subvolume p.inode.

This will be used for efficiently listing subvolumes, as well as
recursive deletion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 52946d82 06-Feb-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill more -EIO error codes

This converts -EIOs related to btree node errors to private error codes,
which will help with some ongoing debugging by giving us better error
messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b97de453 15-Jan-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improve trace_trans_restart_relock

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8e7834a8 16-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch_fs_usage_base

Split out base filesystem usage into its own type; prep work for
breaking up bch2_trans_fs_usage_apply().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 38c23fb8 07-Jan-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: BTREE_TRIGGER_ATOMIC

Add a new flag to be explicit about when we're running atomic triggers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0c99e17d 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: growable btree_paths

XXX: we're allocating memory with btree locks held - bad

We need to plumb through an error path so we can do
allocate_dropping_locks() - but we're merging this now because it fixes
a transaction path overflow caused by indirect extent fragmentation, and
the resize path is rare.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2c3b0fc3 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: trans->nr_paths

Start to plumb through dynamically growable btree_paths; this patch
replaces most BTREE_ITER_MAX references with trans->nr_paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5cc6daf7 12-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: trans->updates will also be resizable

the reflink triggers are also bumping up against the maximum number of
paths in a transaction - and generating proportional numbers of updates.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 31403dca 11-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: optimize __bch2_trans_get(), kill DEBUG_TRANSACTIONS

- Some tweaks to greatly reduce locking overhead for the list of btree
transactions, so that it can always be enabled: leave btree_trans
objects on the list when they're on the percpu single item freelist,
and only check for duplicates in the same process when
CONFIG_BCACHEFS_DEBUG is enabled

- don't zero out the full btree_trans() unless we allocated it from
the mempool

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# fea153a8 12-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: rcu protect trans->paths

Upcoming patches are going to be changing trans->paths to a
reallocatable buffer. We need to guard against use after free when it's
used by other threads; this introduces RCU protection to those paths and
changes them to check for trans->paths == NULL

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6474b706 11-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Clean up btree_trans

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 398c9834 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: kill btree_path.idx

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 7f9821a7 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_insert_entry -> btree_path_idx_t

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 07f383c7 03-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_iter -> btree_path_idx_t

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 255ebbbf 08-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_path_get() -> btree_path_idx_t

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 67997234 11-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: kill btree_trans->wb_updates

the btree write buffer path now creates a journal entry directly

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 24de63da 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improve trans->extra_journal_entries

Instead of using a darray, we now allocate journal entries for the
transaction commit path with our normal bump allocator - with an inlined
fastpath, and using btree_transaction_stats to remember how much to
initially allocate so as to avoid transaction restarts.

This is prep work for converting write buffer updates to use this
mechanism.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a83b6c89 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: kill btree_path->(alloc_seq|downgrade_seq)

These were for extra info in tracepoints for debugging a specialized
issue - we do not want to bloat btree_path for this, at least in release
builds.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6e92d155 03-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Refactor trans->paths_allocated to be standard bitmap

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ad9c7992 17-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill btree_iter->journal_pos

For BTREE_ITER_WITH_JOURNAL, we memoize lookups in the journal keys, to
avoid the binary search overhead.

Previously we stashed the pos of the last key returned from the journal,
in order to force the lookup to be redone when rewinding.

Now bch2_journal_keys_peek_upto() handles rewinding itself when
necessary - so we can slim down btree_iter.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1ae8a090 16-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill memset() in bch2_btree_iter_init()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e56978c8 12-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill BTREE_ITER_ALL_LEVELS

As discussed in the previous patch, BTREE_ITER_ALL_LEVELS appears to be
racy with concurrent interior node updates - and perhaps it is fixable,
but it's tricky and unnecessary.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 43c7ede0 08-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill BTREE_UPDATE_PREJOURNAL

With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can
now switch the btree write buffer to use it for flushing.

This has the advantage that transaction commits don't need to take a
journal reservation at all.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 7125063f 02-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't decrease BTREE_ITER_MAX when LOCKDEP=y

Running with fewer max btree paths doesn't work anymore when replication
is enabled - as we've added e.g. the freespace and bucket gens btrees,
we naturally end up needing more btree paths.

This is an issue with lockdep, we end up taking more locks than lockdep
will track (the MAX_LOCKD_DEPTH constant). But bcachefs as merged does
not yet support lockdep anyways, so we can leave that for later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 006ccc30 04-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill journal pre-reservations

This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3b8c4507 06-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_trans->write_locked

As prep work for the next patch to fix a key cache reclaim issue, we
need to start tracking whether we're currently holding write locks - so
that we can release and retake the before calling into memory reclaim.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1bd5bcc9 13-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Split out btree_key_cache_types.h

More consistent organization.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 9fcdd23b 04-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add a comment for BTREE_INSERT_NOJOURNAL usage

BTREE_INSERT_NOJOURNAL is primarily used for a performance optimization
related to inode updates and fsync - document it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d3c7727b 03-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: rebalance_work btree is not a snapshots btree

rebalance_work entries may refer to entries in the extents btree, which
is a snapshots btree, or they may also refer to entries in the reflink
btree, which is not.

Hence rebalance_work keys may use the snapshot field but it's not
required to be nonzero - add a new btree flag to reflect this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c4accde4 29-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Ensure srcu lock is not held too long

The SRCU read lock that btree_trans takes exists to make it safe for
bch2_trans_relock() to deref pointers to btree nodes/key cache items we
don't have locked, but as a side effect it blocks reclaim from freeing
those items.

Thus, it's important to not hold it for too long: we need to
differentiate between bch2_trans_unlock() calls that will be only for a
short duration, and ones that will be for an unbounded duration.

This introduces bch2_trans_unlock_long(), to be used mainly by the data
move paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# be9e782d 27-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't downgrade locks on transaction restart

We should only be downgrading locks on success - otherwise, our
transaction restarts won't be getting the correct locks and we'll
livelock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 50a38ca1 19-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix btree_node_type enum

More forwards compatibility fixups: having BKEY_TYPE_btree at the end of
the enum conflicts with unnkown btree IDs, this shifts BKEY_TYPE_btree
to slot 0 and fixes things up accordingly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ecae0bd5 02-Nov-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:

- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'

- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested

- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:

mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'

- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code

- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'

- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'

- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'

- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'

- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames

- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use

- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code

- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'

- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2

- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'

- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says

- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()

- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'

- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans

- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'

- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU

- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code

- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result

- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions

- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements

- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things

- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'

- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults

- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code

- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'

- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'

- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'

- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'

- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'

- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'

- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'

- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark

- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'

- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'

- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'

- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in

https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

with help from Qi Zheng.

The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...


# 6bd68ec2 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Heap allocate btree_trans

We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 96dea3d5 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix W=12 build errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e8d2fe3b 21-Jul-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Consolidate btree id properties

This refactoring centralizes defining per-btree properties.

bch2_key_types_allowed was also about to overflow a u32, so expand that
to a u64.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# eabb10dc 19-Jul-2023 Brian Foster <bfoster@redhat.com>

bcachefs: support btree updates of prejournaled keys

Introduce support for prejournaled key updates. This allows a
transaction to commit an update for a key that already exists (and
is pinned) in the journal. This is required for btree write buffer
updates as the current scheme of journaling both on write buffer
insertion and write buffer (slow path) flush is unsafe in certain
crash recovery scenarios.

Create a small trans update wrapper to pass along the seq where the
key resides into the btree_insert_entry. From there, trans commit
passes the seq into the btree insert path where it is used to manage
the journal pin for the associated btree leaf.

Note that this patch only introduces the underlying mechanism and
otherwise includes no functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 73bd774d 06-Jul-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Assorted sparse fixes

- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 95b595a5 30-Apr-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree iterator, update flags no longer conflict

Change btree_update_flags to start after the last btree iterator flag,
so that we can pass both in the same flags argument.

This is needed for the upcoming bch2_bkey_get_mut() helper.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f3a65bb9 27-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Convert constants to consts

Rust bindgen doesn't handle macros, but it does handle integer
constants: this conversion aids in implementing safe Rust wrapper
interfaces.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 5e2d8be8 18-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Split trans->last_begin_ip and trans->last_restarted_ip

These are two different things - this improves our debug assert
messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 930c0c4c 11-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add missing include

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 920e69bc 03-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree write buffer

This adds a new method of doing btree updates - a straight write buffer,
implemented as a flat fixed size array.

This is only useful when we don't need to read from the btree in order
to do the update, and when reading is infrequent - perfect for the LRU
btree.

This will make LRU btree updates fast enough that we'll be able to use
it for persistently indexing buckets by fragmentation, which will be a
massive boost to copygc performance.

Changes:
- A new btree_insert_type enum, for btree_insert_entries. Specifies
btree, btree key cache, or btree write buffer.

- bch2_trans_update_buffered(): updates via the btree write buffer
don't need a btree path, so we need a new update path.

- Transaction commit path changes:
The update to the btree write buffer both mutates global, and can
fail if there isn't currently room. Therefore we do all write buffer
updates in the transaction all at once, and also if it fails we have
to revert filesystem usage counter changes.

If there isn't room we flush the write buffer in the transaction
commit error path and retry.

- A new persistent option, for specifying the number of entries in the
write buffer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 30ca6ece 09-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill trans->flags

Recursive transaction commits are occasionally necessary - in
particular, for the upcoming btree write buffer's flush path.

This avoids bugs due to trans->flags being accidentally mutated
mid-commit, which can cause c->writes refcount leaks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 60b55388 09-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: trans->notrace_relock_fail

When we unlock in order to submit IO, the next relock event is likely to
fail if submit_bio() blocked - we shouldn't those events in our _fail
stats, since those are expected events and shouldn't cause test
failures.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 94c69faf 04-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Use six_lock_ip()

This uses the new _ip() interface to six locks and hooks it up to
btree_path->ip_allocated, when available.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 464b4155 05-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix bch2_trans_reset_updates()

This should have been resetting trans->fs_usage_deltas as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ee2c6ea7 08-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_iter->ip_allocated

In debug mode, we now track where btree iterators and paths are
initialized/allocated - helpful in tracking down btree path overflows.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 5bbe3f2d 14-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Log more messages in the journal

This patch

- Adds a mechanism for queuing up journal entries prior to the journal
being started, which will be used for early journal log messages

- Adds bch2_fs_log_msg() and improves bch2_trans_log_msg(), which now
take format strings. bch2_fs_log_msg() can be used before or after
the journal has been started, and will use the appropriate mechanism.

- Deletes the now obsolete bch2_journal_log_msg()

- And adds more log messages to the recovery path - messages for
journal/filesystem started, journal entries being blacklisted, and
journal replay starting/finishing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e242b92a 15-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix for long running btree transactions & key cache

While a btree transaction is running, we hold a SRCU read lock on the
btree key cache that prevents btree key cache keys from being freed -
this is so that relock() operations won't access freed memory.

The downside of this is that long running btree transactions prevent
memory from being freed from the key cache. This adds a check in
bch2_trans_begin() - if the transaction has been running longer than 1
second, drop and retake the SRCU read lock and zero out pointers to
unlock key cache paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# a16b19cd 02-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Allow for more btrees

Expand some bitfields so we can keep adding more btrees.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ac9fa4bd 07-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill btree_insert_ret enum

Replace with standard bcachefs-private error codes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1617d56d 22-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Key cache now works for snapshots btrees

This switches btree_key_cache_fill() to use a btree iterator, not a
btree path, so that it can search for keys in previous snapshots.

We also add another iterator flag, BTREE_ITER_KEY_CACHE_FILL, to avoid
recursion back into the key cache.

This will allow us to re-enable the key cache for inodes in the next
patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 087e53c2 20-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Bring back BTREE_ITER_CACHED_NOFILL

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# c96f108b 24-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Optimize bch2_trans_iter_init()

When flags & btree_id are constants, we can constant fold the entire
calculation of the actual iterator flags - and the whole thing becomes
small enough to inline.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 42af0ad5 17-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix a race with b->write_type

b->write_type needs to be set atomically with setting the
btree_node_need_write flag, so move it into b->flags.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 46fee692 28-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improved btree write statistics

This replaces sysfs btree_avg_write_size with btree_write_stats, which
now breaks out statistics by the source of the btree write.

Btree writes that are too small are a source of inefficiency, and
excessive btree resort overhead - this will let us see what's causing
them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# fd0c7679 22-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Convert to __packed and __aligned

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 77671e8f 21-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Move bkey bkey_unpack_key() to bkey.h

Long ago, bkey_unpack_key() was added to bset.h instead of bkey.h
because bkey.h didn't include btree_types.h, which it needs for the
compiled unpack function.

This patch finally moves it to the proper location.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 0d7009d7 22-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Delete old deadlock avoidance code

This deletes our old lock ordering based deadlock avoidance code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 33bd5d06 22-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Deadlock cycle detector

We've outgrown our own deadlock avoidance strategy.

The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.

Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.

That approach had some issues, though.
- Sometimes we'd issue transaction restarts unnecessarily, when no
deadlock would have actually occured. Lock ordering restarts have
become our primary cause of transaction restarts, on some workloads
totally 20% of actual transaction commits.

- To avoid deadlock or livelock, we'd often have to take intent locks
when we only wanted a read lock: with the lock ordering approach, it
is actually illegal to hold _any_ read lock while blocking on an intent
lock, and this has been causing us unnecessary lock contention.

- It was getting fragile - the various lock ordering rules are not
trivial, and we'd been seeing occasional livelock issues related to
this machinery.

So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.

When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.

If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.

Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 3d21d48e 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix usage of six lock's percpu mode, key cache version

Similar to "bcachefs: Fix usage of six lock's percpu mode", six locks
have a percpu mode, but we can't switch between percpu and non percpu
modes while a lock is in use: threads attempting to take a read lock may
race, and we'll end up with the read count permanently off.

Fixing this the "correct" way, in six_lock_pcpu_(alloc|free) would
require an RCU barrier, and we don't want to do that - instead, we have
to permanently segragate percpu and non percpu objects, including when
on freelists.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 4e6defd1 31-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_bkey_cached_common->cached

Add a type descriptor to btree_bkey_cached_common - there's no reason
not to since we've got padding that was otherwise unused, and this is a
nice cleanup (and helpful in later patches).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 616928c3 22-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Track maximum transaction memory

This patch
- tracks maximum bch2_trans_kmalloc() memory used in btree_transaction_stats
- makes it available in debugfs
- switches bch2_trans_init() to using that for the amount of memory to
preallocate, instead of the parameter passed in

This drastically reduces transaction restarts, and means we no longer
need to track this in the source code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2e27f656 21-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill nodes_intent_locked

Previously, we used two different bit arrays for tracking held btree
node locks. This patch switches to an array of two bit integers, which
will let us track, in a future patch, when we hold a write lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# cd5afabe 19-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_locking.c

Start to centralize some of the locking code in a new file; more locking
code will be moving here in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 5c0bb66a 11-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Track the maximum btree_paths ever allocated by each transaction

We need a way to check if the machinery for handling btree_paths with in
a transaction is behaving reasonably, as it often has not been - we've
had bugs with transaction path overflows caused by duplicate paths and
plenty of other things.

This patch tracks, per transaction fn, the most btree paths ever
allocated by that transaction and makes it available in debugfs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 4aba7d45 11-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Rename lock_held_stats -> btree_transaction_stats

Going to be adding more things to this in the next patch.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 6fae65c1 10-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_CACHED_(NOFILL|NOCREATE)

These were used more prior to getting rid of the in-memory bucket arrays
- they don't serve much purpose anymore, and deleting them lets us write
better assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 315c9ba6 10-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_NO_NODE -> BCH_ERR codes

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 86b74451 05-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix bch2_btree_trans_to_text()

bch2_btree_trans_to_text() is used to print btree_transactions owned by
other threads; thus, it needs to be particularly careful. This fixes a
null ptr deref caused by racing with the owning thread changing
path->l[].b.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 549d173c 17-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: EINTR -> BCH_ERR_transaction_restart

Now that we have error codes, with subtypes, we can switch to our own
error code for transaction restarts - and even better, a distinct error
code for each transaction restart reason: clearer code and better
debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# e941ae7d 17-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a counter for btree_trans restarts

This will help us improve nested transactions - we need to add
assertions that whenever an inner transaction handles a restart, it
still returns -EINTR to the outer transaction.

This also adds nested_lockrestart_do() and nested_commit_do() which use
the new counters to correctly return -EINTR when the transaction was
restarted.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 89333156 16-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Convert bch2_dev_usrdata_drop() to for_each_btree_key2()

The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# c807ca95 14-Jul-2022 Daniel Hill <daniel@gluo.nz>

bcachefs: added lock held time stats

We now record the length of time btree locks are held and expose this in debugfs.

Enabled via CONFIG_BCACHEFS_LOCK_TIME_STATS.

Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 43de721a 13-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Unlock in bch2_trans_begin() if we've held locks more than 10us

We try to ensure we never hold btree locks for too long - bcachefs tries
to be soft realtime. This adds a check when restarting a transaction,
where a transaction restart is cheap - if we've been holding locks for
too long, drop and retake them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 8f7f566f 16-Jun-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree key cache pcpu freedlist

Originally, the btree key cache code would always allocate new entries
by reusing from the recently-freed list, if that list wasn't empty. But
that behaviour was dropped, for lock contention reasons.

But it seems that entries stranded on the freed list have been
contributing to some of our oom issues, because long running btree
transactions will prevent them from being freed.

This patch re-adds allocating from the freed list, but it also adds
percpu buffers to solve the lock contention issues - and the new percpu
freed lists will improve the evict paths, too.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 50b13bee 17-Jun-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve an error message

When inserting a key type that's not valid for a given btree, we should
print out which btree we were inserting into.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# b7c11046 19-May-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Increase max size for btree_trans bump allocator

With backpointers, alloc keys have gotten bigger, so we're needing more
memory here.

We're probably going to need to go with something more sophisticated
than a bump allocator, but - let's see if we can avoid doing that just
yet.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 30525f68 21-May-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix journal_keys_search() overhead

Previously, on every btree_iter_peek() operation we were searching the
journal keys, doing a full binary search - which was slow.

This patch fixes that by saving our position in the journal keys, so
that we only do a full binary search when moving our position backwards
or a large jump forwards.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# b0babf2a 12-Apr-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_btree_iter_peek_all_levels()

This adds bch2_btree_iter_peek_all_levels(), which returns keys from
every level of the btree - interior nodes included - in monotonically
increasing order, soon to be used by the backpointers check & repair
code.

- BTREE_ITER_ALL_LEVELS can now be passed to for_each_btree_key() to
iterate thusly, much like BTREE_ITER_SLOTS

- The existing algorithm in bch2_btree_iter_advance() doesn't work with
peek_all_levels(): we have to defer the actual advancing until the
next time we call peek, where we have the btree path traversed and
uptodate. So, we add an advanced bit to btree_iter; when
BTREE_ITER_ALL_LEVELS is set bch2_btree_iter_advanced() just marks
the iterator as advanced.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e1b8f5f5 31-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Plumb btree_id & level to trans_mark

For backpointers, we'll need the full key location - that means btree_id
and btree level. This patch plumbs it through.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# c6b2826c 11-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Freespace, need_discard btrees

This adds two new btrees for the upcoming allocator rewrite: an extents
btree of free buckets, and a btree for buckets awaiting discards.

We also add a new trigger for alloc keys to keep the new btrees up to
date, and a compatibility path to initialize them on existing
filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3d48a7f8 31-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: KEY_TYPE_alloc_v4

This introduces a new alloc key which doesn't use varints. Soon we'll be
adding backpointers and storing them in alloc keys, which means our
pack/unpack workflow for alloc keys won't really work - we'll need to be
mutating alloc keys in place.

Instead of bch2_alloc_unpack(), we now have bch2_alloc_to_v4() that
converts older types of alloc keys to v4 if needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2a6870ad 29-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use darray for extra_journal_entries

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 3a306f3c 17-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix large key cache keys

Previously, we'd go into an infinite loop when attempting to cache a
bkey in the key cache larger than 128 u64s - since we were only using a
u8 for the size field, it'd get rounded up to 256 then truncated to 0.
Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 30985537 04-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix usage of six lock's percpu mode

Six locks have a percpu mode, which we use for interior btree nodes, as
well as btree key cache keys for the subvolumes btree. We've been
switching locks back and forth between percpu and non percpu mode as
needed, but it turns out this is racy - when we're reusing an existing
node, other threads could be attempting to lock it while we're switching
it between modes.

This patch fixes this by never switching 'struct btree' between the two
modes, and instead segragating them between two different freed lists.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# bf3efff5 27-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix race leading to btree node write getting stuck

Checking btree_node_may_write() isn't atomic with the other btree flags,
dirty and need_write in particular. There was a rare race where we'd
unblock a node from writing while __btree_node_flush() was setting
need_write, and no thread would notice that the node was now both able
to write and needed to be written.

Fix this by adding btree node flags for will_make_reachable and
write_blocked that can be checked in the cmpxchg loop in
__bch2_btree_node_write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# de517c95 26-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use x-macros for btree node flags

This is for adding an array of strings for btree node flag names.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 8322a937 04-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree key cache optimization

This helps with lock contention in the journalling code: instead of
updating our journal pin on every write, only get a journal pin if we
don't have one.

This means we can avoid hammering on journal locks nearly so much, at
the cost of carrying around a journal pin for an older entry than the
one we actually need. To handle that, if needed we update our journal
pin to the correct one when flushed by journal reclaim.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8f9ad91a 17-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix failure to allocate btree node in cache

The error code when we fail to allocate a node in the btree node cache
doesn't make it to bch2_btree_path_traverse_all(). Instead, we need to
stash a flag in btree_trans so we know we have to take the cannibalize
lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# c7ce2732 15-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Also show when blocked on write locks

This consolidates some of the btree node lock path, so that when we're
blocked taking a write lock on a node it shows up in
bch2_btree_trans_to_text(), along with intent and read locks.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 12ce5b7d 11-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree key cache coherency

- Updates to non key cache iterators will now be transparently
redirected to the key cache for cached btrees.

- Except when creating new keys: then the update goes to underlying
btree

For for iterating over a cached btree to work, we need to ensure that if
a key exists in the key cache, it also exists in the btree - otherwise
the iterator code will skip past it and not check the key cache.

Otherwise, for consistency, all updates should go to the same place -
the key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f7b6ca23 06-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_WITH_KEY_CACHE

This is the start of cache coherency with the btree key cache - this
adds a btree iterator flag that causes lookups to also check the key
cache when we're iterating over the btree (not iterating over the key
cache).

Note that we could still race with another thread creating at item in
the key cache and updating it, since we aren't holding the key cache
locked if it wasn't found. The next patch for the update path will
address this by causing the transaction to restart if the key cache is
found to be dirty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2e63e180 24-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Stash a copy of key being overwritten in btree_insert_entry

We currently need to call bch2_btree_path_peek_slot() multiple times in
the transaction commit path - and some of those need to be updated to
also check the keys from journal replay, too. Let's consolidate this and
stash the key being overwritten in btree_insert_entry.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 1f2d9192 08-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: iter->update_path

With BTREE_ITER_FILTER_SNAPSHOTS, we have to distinguish between the
path where the key was found, and the path for inserting into the
current snapshot. This adds a new field to struct btree_iter for saving
a path for the current snapshot, and plumbs it through
bch2_trans_update().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 669f87a5 03-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Switch to __func__for recording where btree_trans was initialized

Symbol decoding, via %ps, isn't supported in userspace - this will also
be faster when we're using trans->fn in the fast path, as with the new
BCH_JSET_ENTRY_log journal messages.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 5222a460 25-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_WITH_JOURNAL

This adds a new btree iterator flag, BTREE_ITER_WITH_JOURNAL, that is
automatically enabled when initializing a btree iterator before journal
replay has completed - it overlays the contents of the journal with the
btree.

This lets us delete bch2_btree_and_journal_walk() and just use the
normal btree iterator interface instead - which also lets us delete a
significant amount of duplicated code.

Note that BTREE_ITER_WITH_JOURNAL is still unoptimized in this patch -
we're redoing the binary search over keys in the journal every time we
call bch2_btree_iter_peek().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# fb64f3fd 31-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BCH_JSET_ENTRY_log

Add a journal entry type for logging messages, and add an option to use
it to log the transaction name - this makes for a very handy debugging
tool, as with it we can use the 'bcachefs list_journal' command to see
not only what updates were done, but what was doing them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# f3e1f444 21-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_NOPRESERVE

This adds a flag to not mark the initial btree_path as preserve, for
paths that we expect to be cheap to reconstitute if necessary - this
solves a btree_path overflow caused by need_whiteout_for_snapshot().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# b84d42c3 16-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Split out CONFIG_BCACHEFS_DEBUG_TRANSACTIONS

This puts the btree_transactions sysfs/debugfs file behind a separate
config option - it's highly useful, but not cheap enough to enable
permenantly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f0c3f88b 26-Oct-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Run insert triggers before overwrite triggers

Currently, btree triggers are run in natural key order, which presents a
problem for fallocate in INSERT_RANGE mode: since we're moving existing
extents to higher offsets, the trigger for deleting the old extent runs
before the trigger that adds the new extent, potentially leading to
indirect extents being deleted that shouldn't be when the delete causes
the refcount to hit 0.

This changes the order we run triggers so that for a givin btree, we run
all insert triggers before overwrite triggers, nicely sidestepping this
issue.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 3e52c222 29-Oct-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add journal_seq to inode & alloc keys

Add fields to inode & alloc keys that record the journal sequence number
when they were most recently modified.

For alloc keys, this is needed to know what journal sequence number we
have to flush before the bucket can be reused. Currently this is tracked
in memory, but we'll be getting rid of the in memory bucket array.

For inodes, this is needed for fsync when the inode has been evicted
from the vfs cache. Currently we use a bloom filter per outstanding
journal buf - but that mechanism has been broken since we added the
ability to not issue a flush/fua for every journal write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 34d74830 30-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: BTREE_UPDATE_NOJOURNAL

We're going to have btree updates that don't need to be journalled; add
a flag for that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 9a796fdb 19-Oct-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_trans_exit() no longer returns errors

Now that peek_node()/next_node() are converted to return errors
directly, we don't need bch2_trans_exit() to return errors - it's
cleaner this way and wasn't used much anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# c075ff70 04-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_FILTER_SNAPSHOTS

For snapshots, we need to implement btree lookups that return the first
key that's an ancestor of the snapshot ID the lookup is being done in -
and filter out keys in unrelated snapshots. This patch adds the btree
iterator flag BTREE_ITER_FILTER_SNAPSHOTS which does that filtering.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 14b393ee 15-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Subvolumes, snapshots

This patch adds subvolume.c - support for the subvolumes and snapshots
btrees and related data types and on disk data structures. The next
patches will start hooking up this new code to existing code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 3074bc0f 15-Sep-2021 Kent Overstreet <kent.overstreet@gmail.com>

Revert "bcachefs: Add more assertions for locking btree iterators out of order"

Figured out the bug we were chasing, and it had nothing to do with
locking btree iterators/paths out of order.

This reverts commit ff08733dd298c969aec7c7828095458f73fd5374.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 068bcaa5 03-Sep-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add more assertions for locking btree iterators out of order

btree_path_traverse_all() traverses btree iterators in sorted order, and
thus shouldn't see transaction restarts due to potential deadlocks - but
sometimes we do. This patch adds some more assertions and tracks some
more state to help track this down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 67e0dd8f 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_path

This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.

This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f21566f1 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_NODES

We really only need to distinguish between btree iterators and btree key
cache iterators - this is more prep work for btree_path.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# deb0e573 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_NEED_PEEK

This was used for an optimization that hasn't existing in quite awhile
- iter->uptodate will probably be going away as well.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 6fba6b83 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Prefer using btree_insert_entry to btree_iter

This moves some data dependencies forward, to improve pipelining.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 5f8077cc 29-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_SET_POS_AFTER_COMMIT

BTREE_ITER_SET_POS_AFTER_COMMIT is used internally to automagically
advance extent btree iterators on sucessful commit.

But with the upcomnig btree_path patch it's getting more awkward to
support, and it adds overhead to core data structures that's only used
in a few places, and can be easily done by the caller instead.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 84841b0d 24-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_dump_trans_iters_updates()

This factors out bch2_dump_trans_iters_updates() from the iter alloc
overflow path, and makes some small improvements to what it prints.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 0423fb71 12-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Keep a sorted list of btree iterators

This will be used to make other operations on btree iterators within a
transaction more efficient, and enable some other improvements to how we
manage btree iterators.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e5af273f 25-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: trans->restarted

Start tracking when btree transactions have been restarted - and assert
that we're always calling bch2_trans_begin() immediately after
transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 9f1833ca 10-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Update btree ptrs after every write

This closes a significant hole (and last known hole) in our ability to
verify metadata. Previously, since btree nodes are log structured, we
couldn't detect lost btree writes that weren't the first write to a
given node. Additionally, this seems to have lead to some significant
metadata corruption on multi device filesystems with metadata
replication: since a write may have made it to one device and not
another, if we read that btree node back from the replica that did have
that write and started appending after that point, the other replica
would have a gap in the bset entries and reading from that replica
wouldn't find the rest of the bsets.

But, since updates to interior btree nodes are now journalled, we can
close this hole by updating pointers to btree nodes after every write
with the currently written number of sectors, without negatively
affecting performance. This means we will always detect lost or corrupt
metadata - it also means that our btree is now a curious hybrid of COW
and non COW btrees, with all the benefits of both (excluding
complexity).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# b00fde8f 05-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE

Add a new flag to control assertions about updating to internal snapshot
nodes, that normally should not be written to - to be used in an
upcoming patch.

Also do some renaming - trigger_flags is now update_flags.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 297d8934 10-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Extensive triggers cleanups

- We no longer mark subsets of extents, they're marked like regular
keys now - which means we can drop the offset & sectors arguments
to trigger functions
- Drop other arguments that are no longer needed anymore in various
places - fs_usage
- Drop the logic for handling extents in bch2_mark_update() that isn't
needed anymore, to match bch2_trans_mark_update()
- Better logic for hanlding the BTREE_ITER_CACHED_NOFILL case, where we
don't have an old key to mark

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# a49e9a05 12-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix null ptr deref when splitting compressed extents

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# cd8319fd 07-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill trans->updates2

Now that extent handling has been lifted to bch2_trans_update(), we
don't need to keep two different lists of updates.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 5288e66a 03-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_WITH_UPDATES

This drops bch2_btree_iter_peek_with_updates() and replaces it with a
new flag, BTREE_ITER_WITH_UPDATES, and also reworks
bch2_btree_iter_peek_slot() to respect it too.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 509d3e0a 20-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Child btree iterators

This adds the ability for btree iterators to own child iterators - to be
used by an upcoming rework of bch2_btree_iter_peek_slot(), so we can
scan forwards while maintaining our current position.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 66a0a497 04-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_iter->should_be_locked

Add a field to struct btree_iter for tracking whether it should be
locked - this fixes spurious transaction restarts in
bch2_trans_relock().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 531a0095 04-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve btree iterator tracepoints

This patch adds some new tracepoints to the btree iterator code, and
adds new fields to the existing tracepoints - primarily for the iterator
position.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# d7f35163 09-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix BTREE_ITER_NOT_EXTENTS

bch2_btree_iter_peek() wasn't properly checking for
BTREE_ITER_IS_EXTENTS when updating iter->pos.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3ce8b463 03-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: kill bset_tree->max_key

Since we now ensure a btree node's max key fits in its packed format,
this isn't needed for the reasons it used to be - and, it was being used
inconsistently.

Also reorder struct btree a bit for performance, and kill some dead
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 5c1d808a 31-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Drop trans->nounlock

Since we're no longer doing btree node merging post commit, we can now
delete a bunch of code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e751c01a 24-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Start using bpos.snapshot field

This patch starts treating the bpos.snapshot field like part of the key
in the btree code:

* bpos_successor() and bpos_predecessor() now include the snapshot field
* Keys in btrees that will be using snapshots (extents, inodes, dirents
and xattrs) now always have their snapshot field set to U32_MAX

The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that
determines whether we're iterating over keys in all snapshots or not -
internally, this controlls whether bkey_(successor|predecessor)
increment/decrement the snapshot field, or only the higher bits of the
key.

We add a new member to struct btree_iter, iter->snapshot: when
BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always
equal iter->snapshot, which will be 0 for btrees that don't use
snapshots, and alsways U32_MAX for btrees that will use snapshots
(until we enable snapshot creation).

This patch also introduces a new metadata version number, and compat
code for reading from/writing to older versions - this isn't a forced
upgrade (yet).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 43d00243 03-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a mechanism for running callbacks at trans commit time

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 331194a2 24-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree key cache locking improvements

The btree key cache mutex was becoming a significant bottleneck - it was
mainly used to protect the lists of dirty, clean and freed cached keys.

This patch eliminates the dirty and clean lists - instead, when we need
to scan for keys to drop from the cache we iterate over the rhashtable,
and thus we're able to remove most uses of that lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# c7bb769c 19-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: __bch2_trans_get_iter() refactoring, BTREE_ITER_NOT_EXTENTS

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1f7fdc0a 20-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_iter_live()

New helper to clean things up a bit - also, improve iter->flags
handling.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 6333bd2f 20-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve handling of extents in bch2_trans_update()

The transaction update/commit path cares about whether it's inserting
extents or regular keys; extents require extra passes (handling of
overlapping extents) but sometimes we want to skip all that. This
clarifies things by adding a new member to btree_insert_entry specifying
whether the key being inserted is an extent, instead of overloading
BTREE_ITER_IS_EXTENTS.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 41f8b09e 20-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Rename BTREE_ID enums for consistency with other enums

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f2785955 19-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill support for !BTREE_NODE_NEW_EXTENT_OVERWRITE()

bcachefs has been aggressively migrating filesystems and btree nodes to
the new format for quite some time - this shouldn't affect anyone
anymore, and lets us delete a _lot_ of code. Also, it frees up
KEY_TYPE_discard for a new whiteout key type for snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e131b6aa 23-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a mempool for btree_trans bump allocator

This allocation is required for filesystem operations to make forward
progress, thus needs a mempool.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1889ad5a 14-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add code to scan for/rewite old btree nodes

This adds a new data job type to scan for btree nodes in the old extent
format, and rewrite them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 7e1a3aa9 11-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: iter->real_pos

We need to differentiate between the search position of a btree
iterator, vs. what it actually points at (what we found). This matters
for extents, where iter->pos will typically be the start of the key we
found and iter->real_pos will be the end of the key we found (which soon
won't necessarily be in the same btree node!) and it will also matter
for snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 07a1006a 17-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Reduce/kill BKEY_PADDED use

With various newer key types - stripe keys, inline data extents - the
old approach of calculating the maximum size of the value is becoming
more and more error prone. Better to switch to bkey_on_stack, which can
dynamically allocate if necessary to handle any size bkey.

In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 5db43418 03-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't issue btree writes that weren't journalled

If we have an error in the btree interior update path that prevents us
from journalling the update, we can't issue the corresponding btree node
write - we didn't get a journal sequence number that would cause it to
be ignored in recovery.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3eb26d01 01-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_trans_get_iter() no longer returns errors

Since we now always preallocate the maximum number of iterators when we
initialize a btree transaction, getting an iterator never fails - we can
delete a fair amount of error path code.

This patch also simplifies the iterator allocation code a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# d0022290 29-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix error in filesystem initialization

The rhashtable code doesn't like when we destroy an rhashtable that was
never initialized

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# d5425a3b 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Throttle updates when btree key cache is too dirty

This is needed to ensure we don't deadlock because journal reclaim and
thus memory reclaim isn't making forward progress.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 12590720 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve btree key cache shrinker

The shrinker should start scanning for entries that can be freed oldest
to newest - this way, we can avoid scanning a lot of entries that are
too new to be freed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 628a3ad2 12-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a shrinker for the btree key cache

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 876c7af3 15-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Take a SRCU lock in btree transactions

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 6a747c46 09-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add accounting for dirty btree nodes/keys

This lets us improve journal reclaim, so that it now tries to make sure
no more than 3/4s of the btree node cache and btree key cache are dirty
- ensuring the shrinkers can free memory.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ae1ede58 02-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't embed btree iters in btree_trans

These haven't been in used since reallocing iterators has been disabled,
and saves us a lot of stack if we get rid of it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 29364f34 02-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Drop sysfs interface to debug parameters

It's not used much anymore, the module paramter interface is better.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# dcf141b9 28-Oct-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix spurious transaction restarts

The check for whether locking a btree node would deadlock was wrong - we
have to check that interior nodes are locked before descendents, but
this check was wrong when consider cached vs. non cached iterators.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 39283c71 19-Oct-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix for bad stripe pointers

The allocator usually doesn't increment bucket gens right away on
buckets that it's about to hand out (for reasons that need to be
documented), instead deferring that to whatever extent update first
references that bucket.

But stripe pointers reference buckets without changing bucket sector
counts, meaning we could end up with a pointer in a stripe with a gen
newer than the bucket it points to.

Fix this by adding a transactional trigger for KEY_TYPE_stripe that just
writes out the keys in the alloc btree for the buckets it points to.

Also - consolidate the code that checks pointer validity.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f3721e12 16-Oct-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Perf improvements for bch_alloc_read()

On large filesystems reading in the alloc info takes a significant
amount of time. But we don't need to be calling into the fully general
bch2_mark_key() path, just open code what we need in
bch2_alloc_read_fn().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 4580baec 25-Jul-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Remove some uses of PAGE_SIZE in the btree code

For portability to userspace, we should try to avoid working in kernel
pages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 697e45b2 06-Jul-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_TRIGGER_NOOVERWRITES

This is prep work for reworking the triggers machinery - we have
triggers that need to know both the old and the new key.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# fff899b1 03-Jul-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Mark btree nodes as needing rewrite when not all replicas are RW

This fixes a bug where recovery fails when one of the devices is read
only.

Also - consolidate the "must rewrite this node to insert it" behind a
new btree node flag.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 64f2a880 28-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix bch2_extent_can_insert() not being called

It's supposed to check whether we're splitting a compressed extent and
if so get a bigger disk reservation - hence this fixes a "disk usage
increased by x without a reservaiton" bug.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e27b03b3 15-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Give bkey_cached_key same attributes as bpos

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2ca88e5a 07-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree key cache

This introduces a new kind of btree iterator, cached iterators, which
point to keys cached in a hash table. The cache also acts as a write
cache - in the update path, we journal the update but defer updating the
btree until the cached entry is flushed by journal reclaim.

Cache coherency is for now up to the users to handle, which isn't ideal
but should be good enough for now.

These new iterators will be used for updating inodes and alloc info (the
alloc and stripes btrees).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 515282ac 12-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a deadlock

__bch2_btree_node_lock() was incorrectly using iter->pos as a proxy for
btree node lock ordering, this caused an off by one error that was
triggered by bch2_btree_node_get_sibling() getting the previous node.

This refactors the code to compare against btree node keys directly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f96c0df4 02-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a deadlock in bch2_btree_node_get_sibling()

There was a bad interaction with bch2_btree_iter_set_pos_same_leaf(),
which can leave a btree node locked that is just outside iter->pos,
breaking the lock ordering checks in __bch2_btree_node_lock(). Ideally
we should get rid of this corner case, but for now fix it locally with
verbose comments.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 495aabed 02-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add debug code to print btree transactions

Intented to help debug deadlocks, since we can't use lockdep to check
btree node lock ordering.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 00b8ccf7 25-May-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Interior btree updates are now fully transactional

We now update the alloc info (bucket sector counts) atomically with
journalling the update to the interior btree nodes, and we also set new
btree roots atomically with the journalled part of the btree update.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 96e2aa1b 25-May-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a mechanism for passing extra journal entries to bch2_trans_commit()

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 0329b150 01-Apr-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Trace where btree iterators are allocated

This will help with iterator overflow bugs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2c31e657 29-Mar-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Reduce max nr of btree iters when lockdep is on

This is so we don't overflow MAX_LOCK_DEPTH.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 6357d607 08-Feb-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Journal updates to interior nodes

Previously, the btree has always been self contained and internally
consistent on disk without anything from the journal - the journal just
contained pointers to the btree roots.

However, this meant that btree node split or compact operations - i.e.
anything that changes btree node topology and involves updates to
interior nodes - would require that interior btree node to be written
immediately, which means emitting a btree node write that's mostly empty
(using 4k of space on disk if the filesystemm blocksize is 4k to only
write perhaps ~100 bytes of new keys).

More importantly, this meant most btree node writes had to be FUA, and
consumer drives have a history of slow and/or buggy FUA support - other
filesystes have been bit by this.

This patch changes the interior btree update path to journal updates to
interior nodes, after the writes for the new btree nodes have completed.
Best of all, it turns out to simplify the interior node update path
somewhat.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e62d65f2 15-Mar-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: trans_commit() path can now insert to interior nodes

This will be needed for the upcoming patches to journal updates to
interior btree nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e3e464ac 30-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Move extent overwrite handling out of core btree code

Ever since the btree code was first written, handling of overwriting
existing extents - including partially overwriting and splittin existing
extents - was handled as part of the core btree insert path. The modern
transaction and iterator infrastructure didn't exist then, so that was
the only way for it to be done.

This patch moves that outside of the core btree code to a pass that runs
at transaction commit time.

This is a significant simplification to the btree code and overall
reduction in code size, but more importantly it gets us much closer to
the core btree code being completely independent of extents and is
important prep work for snapshots.

This introduces a new feature bit; the old and new extent update models
are incompatible when the filesystem needs journal replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2dac0eae 18-Feb-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Iterator debug code improvements

More aggressively checking iterator invariants, and fixing the resulting
bugs. Also greatly simplifying iter_next() and iter_next_slot() - they
were hyper optimized before, but the optimizations were getting too
brittle.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 237e8048 18-Feb-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: introduce b->hash_val

This is partly prep work for introducing bch_btree_ptr_v2, but it'll
also be a bit of a performance boost by moving the full key out of the
hot part of struct btree.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# d98a5e39 09-Jan-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add some comments for btree iterator flags

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 24326cd1 31-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Sort & deduplicate updates in bch2_trans_update()

Previously, when doing multiple update in the same transaction commit
that overwrote each other, we relied on doing the updates in the same
order as the bch2_trans_update() calls in order to get the correct
result. But that wasn't correct for triggers; bch2_trans_mark_update()
when marking overwrites would do the wrong thing because it hadn't seen
the update that was being overwritten.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2d594dfb 31-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Split out btree_trigger_flags

The trigger flags really belong with individual btree_insert_entries,
not the transaction commit flags - this splits out those flags and
unifies them with the BCH_BUCKET_MARK flags. Todo - split out
btree_trigger.c from buckets.c

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 54e86b58 30-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Make btree_insert_entry more private to update path

This should be private to btree_update_leaf.c, and we might end up
removing it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# bcd6f3e0 26-Nov-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use KEY_TYPE_deleted whitouts for extents

Previously, partial overwrites of existing extents were handled
implicitly by the btree code; when reading in a btree node, we'd do a
mergesort of the different bsets and detect and fix partially
overlapping extents during that mergesort.

That approach won't work with snapshots: this changes extents to work
like regular keys as far as the btree code is concerned, where a 0 size
KEY_TYPE_deleted whiteout will completely overwrite an existing extent.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8b3bbe2c 24-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't reexecute triggers when retrying transaction commit

This was causing a bug with transaction iterators overflowing; now, if
triggers have to be reexecuted we always return -EINTR and retry from
the start of the transaction.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# c297a763 13-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Refactor whiteouts compaction

The whiteout compaction path - as opposed to just dropping whiteouts -
is now only needed for extents, and soon will only be needed for extent
btree nodes in the old format.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# c9bebae6 29-Nov-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Whiteout changes

More prep work for snapshots: extents will soon be using
KEY_TYPE_deleted for whiteouts, with 0 size. But we wen't be able to
keep these whiteouts with the rest of the extents in the btree node, due
to sorting invariants breaking.

We can deal with this by immediately moving the new whiteouts to the
unwritten whiteouts area - this just means those whiteouts won't be
sorted, so we need new code to sort them prior to merging them with the
rest of the keys to be written.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2a9101a9 19-Oct-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Refactor bch2_trans_commit() path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8f1965391 19-Oct-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Make btree_node_type_needs_gc() cheaper

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 64bc0011 26-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Rework btree iterator lifetimes

The btree_trans struct needs to memoize/cache btree iterators, so that
on transaction restart we don't have to completely redo btree lookups,
and so that we can do them all at once in the correct order when the
transaction had to restart to avoid a deadlock.

This switches the btree iterator lookups to work based on iterator
position, instead of trying to match them up based on the stack trace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# a7199432 22-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill deferred btree updates

Will be replaced by cached btree iterators

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# bbd8d203 22-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_SLOTS isn't a type of btree iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 36e9d698 07-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Do updates in order they were queued up in

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 76426098 16-Aug-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Reflink

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 88767d65 24-Jun-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Update path now handles triggers that generate more triggers

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3838be78 15-May-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't use a fixed size buffer for fs_usage_deltas

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 61011ea2 14-May-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Rip out old hacky transaction restart tracing

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 87c3beb4 15-May-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a bug with spinning on the journal

Transactional triggers meant that when we failed to get a journal
reservation, then bailed out into the error path to block on a journal
reservation, the second blocking call into the journal code was asking
for less space, which is not what we want.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 60755344 10-May-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: kill BTREE_ITER_NOUNLOCK

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 932aa837 11-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_trans_mark_update()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# c43a6ef9 05-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_bkey_cached_common

This is prep work for the btree key cache: btree iterators will point to
either struct btree, or a new struct bkey_cached.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ba5c6557 22-Apr-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add actual tracepoints for transaction restarts

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ece254b2 04-Apr-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: don't lose errors from iterators that have been freed

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e1120a4c 27-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add iter->idx

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ecc892e4 27-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill btree_iter->next

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 9e5e5b9e 25-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree iterators now always have a btree_trans

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 5df4be3f 25-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree iter improvements

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# dc3b63dc 21-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add time stats for btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 9623ab27 15-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree update path cleanup

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 0dc17247 13-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: kill struct btree_insert

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# c93cead0 16-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Always use bch2_extent_trim_atomic()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# a8e00bd4 07-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: increase BTREE_ITER_MAX

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3e5d6c59 19-Feb-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use journal preres for deferred btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8fe826f9 13-Feb-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Convert bucket invalidation to key marking path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 39fbc5a4 11-Feb-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: gc lock no longer needed for disk reservations

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 42b72e0b 24-Jan-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: journal_replay_early()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3636ed48 17-Jul-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Deferred btree updates

Will be used in the future for inode updates, which will be very helpful
for multithreaded workloads that have to update the inode with every
extent update (appends, or updates that change i_sectors)

Also will be used eventually for fully persistent alloc info

However - we still need a mechanism for reserving space in the journal
prior to getting a journal reservation, so it's not technically safe to
make use of this just yet, we could deadlock with the journal full
(although not likely to be an issue in practice)

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f0cfb963 29-Nov-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Track nr_inodes with the key marking machinery

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 26609b61 01-Nov-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Make bkey types globally unique

this lets us get rid of a lot of extra switch statements - in a lot of
places we dispatch on the btree node type, and then the key type, so
this is a nice cleanup across a lot of code.

Also improve the on disk format versioning stuff.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ad7ae8d6 23-Nov-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree locking fix, refactoring

Hit an assertion, probably spurious, indicating an iterator was unlocked
when it shouldn't have been (spurious because it wasn't locked at all
when the caller called btree_insert_at()).

Add a flag, BTREE_ITER_NOUNLOCK, and tighten up the assertions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1d25849c 07-Nov-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Centralize marking of replicas in btree update path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2252aa27 21-Oct-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree gc refactoring

prep work for erasure coding

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ef337c54 06-Oct-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Allocation code refactoring

bch2_alloc_sectors_start() was a nightmare to work with - it's got some
tricky stuff to do, since it wants to use the buckets the writepoint
already has, unless they're not in the target it wants to write to,
unless it can't allocate from any other devices in which case it will
use those buckets if it has to - et cetera.

This restructures the code to start with a new empty list of open
buckets we're going to use for the new allocation, pulling buckets from
the write point's list as we decide that we really are going to use
them - making the code somewhat more functional and drastically easier
to understand.

Also fixes a bug where we could end up waiting on c->freelist_wait
(because allocating from one device failed) but return success from
bch2_bucket_alloc(), because allocating from a different device
succeeded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# fc3268c1 08-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: kill extent_insert_hook

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# cc1add4a 05-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_INSERT_JOURNAL_RES_FULL is no longer possible

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e4ccb251 05-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: make struct btree_iter a bit smaller

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 271a3d3a 21-Jul-2016 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: lift ordering restriction on 0 size extents

This lifts the restriction that 0 size extents must not overlap with
other extents, which means we can now sort extents and non extents the
same way, and will let us simplify a bunch of other stuff as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1fe08f31 05-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bkey_written()

also cleanups of btree node offsets

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 617391ba 05-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: improved rw_aux_tree_bsearch()

shouldn't be any reason for an actual binary search here

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# b0004d8d 03-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Factor out btree_key_can_insert()

working on getting rid of all the reasons bch2_insert_fixup_extent() can
fail/stop partway, which is needed for other refactorings.

One of the reasons we could have to bail out is if we're splitting a
compressed extent we might need to add to our disk reservation - but we
can check that before actually starting the insert.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1c7a0adf 12-Jul-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: trace transaction restarts

exceptionally crappy "tracing", but it's a start at documenting the
places restarts can be triggered

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1c6fdbd8 17-Mar-2017 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Initial commit

Initially forked from drivers/md/bcache, bcachefs is a new copy-on-write
filesystem with every feature you could possibly want.

Website: https://bcachefs.org

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 6bd68ec2 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Heap allocate btree_trans

We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 96dea3d5 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix W=12 build errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e8d2fe3b 21-Jul-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Consolidate btree id properties

This refactoring centralizes defining per-btree properties.

bch2_key_types_allowed was also about to overflow a u32, so expand that
to a u64.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# eabb10dc 19-Jul-2023 Brian Foster <bfoster@redhat.com>

bcachefs: support btree updates of prejournaled keys

Introduce support for prejournaled key updates. This allows a
transaction to commit an update for a key that already exists (and
is pinned) in the journal. This is required for btree write buffer
updates as the current scheme of journaling both on write buffer
insertion and write buffer (slow path) flush is unsafe in certain
crash recovery scenarios.

Create a small trans update wrapper to pass along the seq where the
key resides into the btree_insert_entry. From there, trans commit
passes the seq into the btree insert path where it is used to manage
the journal pin for the associated btree leaf.

Note that this patch only introduces the underlying mechanism and
otherwise includes no functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 73bd774d 06-Jul-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Assorted sparse fixes

- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 95b595a5 30-Apr-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree iterator, update flags no longer conflict

Change btree_update_flags to start after the last btree iterator flag,
so that we can pass both in the same flags argument.

This is needed for the upcoming bch2_bkey_get_mut() helper.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f3a65bb9 27-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Convert constants to consts

Rust bindgen doesn't handle macros, but it does handle integer
constants: this conversion aids in implementing safe Rust wrapper
interfaces.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5e2d8be8 18-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Split trans->last_begin_ip and trans->last_restarted_ip

These are two different things - this improves our debug assert
messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 930c0c4c 11-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add missing include

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 920e69bc 03-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree write buffer

This adds a new method of doing btree updates - a straight write buffer,
implemented as a flat fixed size array.

This is only useful when we don't need to read from the btree in order
to do the update, and when reading is infrequent - perfect for the LRU
btree.

This will make LRU btree updates fast enough that we'll be able to use
it for persistently indexing buckets by fragmentation, which will be a
massive boost to copygc performance.

Changes:
- A new btree_insert_type enum, for btree_insert_entries. Specifies
btree, btree key cache, or btree write buffer.

- bch2_trans_update_buffered(): updates via the btree write buffer
don't need a btree path, so we need a new update path.

- Transaction commit path changes:
The update to the btree write buffer both mutates global, and can
fail if there isn't currently room. Therefore we do all write buffer
updates in the transaction all at once, and also if it fails we have
to revert filesystem usage counter changes.

If there isn't room we flush the write buffer in the transaction
commit error path and retry.

- A new persistent option, for specifying the number of entries in the
write buffer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 30ca6ece 09-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill trans->flags

Recursive transaction commits are occasionally necessary - in
particular, for the upcoming btree write buffer's flush path.

This avoids bugs due to trans->flags being accidentally mutated
mid-commit, which can cause c->writes refcount leaks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 60b55388 09-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: trans->notrace_relock_fail

When we unlock in order to submit IO, the next relock event is likely to
fail if submit_bio() blocked - we shouldn't those events in our _fail
stats, since those are expected events and shouldn't cause test
failures.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 94c69faf 04-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Use six_lock_ip()

This uses the new _ip() interface to six locks and hooks it up to
btree_path->ip_allocated, when available.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 464b4155 05-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix bch2_trans_reset_updates()

This should have been resetting trans->fs_usage_deltas as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ee2c6ea7 08-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_iter->ip_allocated

In debug mode, we now track where btree iterators and paths are
initialized/allocated - helpful in tracking down btree path overflows.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5bbe3f2d 14-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Log more messages in the journal

This patch

- Adds a mechanism for queuing up journal entries prior to the journal
being started, which will be used for early journal log messages

- Adds bch2_fs_log_msg() and improves bch2_trans_log_msg(), which now
take format strings. bch2_fs_log_msg() can be used before or after
the journal has been started, and will use the appropriate mechanism.

- Deletes the now obsolete bch2_journal_log_msg()

- And adds more log messages to the recovery path - messages for
journal/filesystem started, journal entries being blacklisted, and
journal replay starting/finishing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e242b92a 15-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix for long running btree transactions & key cache

While a btree transaction is running, we hold a SRCU read lock on the
btree key cache that prevents btree key cache keys from being freed -
this is so that relock() operations won't access freed memory.

The downside of this is that long running btree transactions prevent
memory from being freed from the key cache. This adds a check in
bch2_trans_begin() - if the transaction has been running longer than 1
second, drop and retake the SRCU read lock and zero out pointers to
unlock key cache paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a16b19cd 02-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Allow for more btrees

Expand some bitfields so we can keep adding more btrees.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ac9fa4bd 07-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill btree_insert_ret enum

Replace with standard bcachefs-private error codes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1617d56d 22-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Key cache now works for snapshots btrees

This switches btree_key_cache_fill() to use a btree iterator, not a
btree path, so that it can search for keys in previous snapshots.

We also add another iterator flag, BTREE_ITER_KEY_CACHE_FILL, to avoid
recursion back into the key cache.

This will allow us to re-enable the key cache for inodes in the next
patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 087e53c2 20-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Bring back BTREE_ITER_CACHED_NOFILL

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c96f108b 24-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Optimize bch2_trans_iter_init()

When flags & btree_id are constants, we can constant fold the entire
calculation of the actual iterator flags - and the whole thing becomes
small enough to inline.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 42af0ad5 17-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix a race with b->write_type

b->write_type needs to be set atomically with setting the
btree_node_need_write flag, so move it into b->flags.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 46fee692 28-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improved btree write statistics

This replaces sysfs btree_avg_write_size with btree_write_stats, which
now breaks out statistics by the source of the btree write.

Btree writes that are too small are a source of inefficiency, and
excessive btree resort overhead - this will let us see what's causing
them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# fd0c7679 22-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Convert to __packed and __aligned

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 77671e8f 21-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Move bkey bkey_unpack_key() to bkey.h

Long ago, bkey_unpack_key() was added to bset.h instead of bkey.h
because bkey.h didn't include btree_types.h, which it needs for the
compiled unpack function.

This patch finally moves it to the proper location.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0d7009d7 22-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Delete old deadlock avoidance code

This deletes our old lock ordering based deadlock avoidance code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 33bd5d06 22-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Deadlock cycle detector

We've outgrown our own deadlock avoidance strategy.

The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.

Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.

That approach had some issues, though.
- Sometimes we'd issue transaction restarts unnecessarily, when no
deadlock would have actually occured. Lock ordering restarts have
become our primary cause of transaction restarts, on some workloads
totally 20% of actual transaction commits.

- To avoid deadlock or livelock, we'd often have to take intent locks
when we only wanted a read lock: with the lock ordering approach, it
is actually illegal to hold _any_ read lock while blocking on an intent
lock, and this has been causing us unnecessary lock contention.

- It was getting fragile - the various lock ordering rules are not
trivial, and we'd been seeing occasional livelock issues related to
this machinery.

So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.

When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.

If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.

Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 3d21d48e 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix usage of six lock's percpu mode, key cache version

Similar to "bcachefs: Fix usage of six lock's percpu mode", six locks
have a percpu mode, but we can't switch between percpu and non percpu
modes while a lock is in use: threads attempting to take a read lock may
race, and we'll end up with the read count permanently off.

Fixing this the "correct" way, in six_lock_pcpu_(alloc|free) would
require an RCU barrier, and we don't want to do that - instead, we have
to permanently segragate percpu and non percpu objects, including when
on freelists.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4e6defd1 31-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_bkey_cached_common->cached

Add a type descriptor to btree_bkey_cached_common - there's no reason
not to since we've got padding that was otherwise unused, and this is a
nice cleanup (and helpful in later patches).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 616928c3 22-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Track maximum transaction memory

This patch
- tracks maximum bch2_trans_kmalloc() memory used in btree_transaction_stats
- makes it available in debugfs
- switches bch2_trans_init() to using that for the amount of memory to
preallocate, instead of the parameter passed in

This drastically reduces transaction restarts, and means we no longer
need to track this in the source code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2e27f656 21-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill nodes_intent_locked

Previously, we used two different bit arrays for tracking held btree
node locks. This patch switches to an array of two bit integers, which
will let us track, in a future patch, when we hold a write lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# cd5afabe 19-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_locking.c

Start to centralize some of the locking code in a new file; more locking
code will be moving here in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 5c0bb66a 11-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Track the maximum btree_paths ever allocated by each transaction

We need a way to check if the machinery for handling btree_paths with in
a transaction is behaving reasonably, as it often has not been - we've
had bugs with transaction path overflows caused by duplicate paths and
plenty of other things.

This patch tracks, per transaction fn, the most btree paths ever
allocated by that transaction and makes it available in debugfs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4aba7d45 11-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Rename lock_held_stats -> btree_transaction_stats

Going to be adding more things to this in the next patch.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6fae65c1 10-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_CACHED_(NOFILL|NOCREATE)

These were used more prior to getting rid of the in-memory bucket arrays
- they don't serve much purpose anymore, and deleting them lets us write
better assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 315c9ba6 10-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_NO_NODE -> BCH_ERR codes

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 86b74451 05-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix bch2_btree_trans_to_text()

bch2_btree_trans_to_text() is used to print btree_transactions owned by
other threads; thus, it needs to be particularly careful. This fixes a
null ptr deref caused by racing with the owning thread changing
path->l[].b.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 549d173c 17-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: EINTR -> BCH_ERR_transaction_restart

Now that we have error codes, with subtypes, we can switch to our own
error code for transaction restarts - and even better, a distinct error
code for each transaction restart reason: clearer code and better
debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# e941ae7d 17-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a counter for btree_trans restarts

This will help us improve nested transactions - we need to add
assertions that whenever an inner transaction handles a restart, it
still returns -EINTR to the outer transaction.

This also adds nested_lockrestart_do() and nested_commit_do() which use
the new counters to correctly return -EINTR when the transaction was
restarted.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 89333156 16-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Convert bch2_dev_usrdata_drop() to for_each_btree_key2()

The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c807ca95 14-Jul-2022 Daniel Hill <daniel@gluo.nz>

bcachefs: added lock held time stats

We now record the length of time btree locks are held and expose this in debugfs.

Enabled via CONFIG_BCACHEFS_LOCK_TIME_STATS.

Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 43de721a 13-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Unlock in bch2_trans_begin() if we've held locks more than 10us

We try to ensure we never hold btree locks for too long - bcachefs tries
to be soft realtime. This adds a check when restarting a transaction,
where a transaction restart is cheap - if we've been holding locks for
too long, drop and retake them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 8f7f566f 16-Jun-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree key cache pcpu freedlist

Originally, the btree key cache code would always allocate new entries
by reusing from the recently-freed list, if that list wasn't empty. But
that behaviour was dropped, for lock contention reasons.

But it seems that entries stranded on the freed list have been
contributing to some of our oom issues, because long running btree
transactions will prevent them from being freed.

This patch re-adds allocating from the freed list, but it also adds
percpu buffers to solve the lock contention issues - and the new percpu
freed lists will improve the evict paths, too.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 50b13bee 17-Jun-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve an error message

When inserting a key type that's not valid for a given btree, we should
print out which btree we were inserting into.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b7c11046 19-May-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Increase max size for btree_trans bump allocator

With backpointers, alloc keys have gotten bigger, so we're needing more
memory here.

We're probably going to need to go with something more sophisticated
than a bump allocator, but - let's see if we can avoid doing that just
yet.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 30525f68 21-May-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix journal_keys_search() overhead

Previously, on every btree_iter_peek() operation we were searching the
journal keys, doing a full binary search - which was slow.

This patch fixes that by saving our position in the journal keys, so
that we only do a full binary search when moving our position backwards
or a large jump forwards.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# b0babf2a 12-Apr-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_btree_iter_peek_all_levels()

This adds bch2_btree_iter_peek_all_levels(), which returns keys from
every level of the btree - interior nodes included - in monotonically
increasing order, soon to be used by the backpointers check & repair
code.

- BTREE_ITER_ALL_LEVELS can now be passed to for_each_btree_key() to
iterate thusly, much like BTREE_ITER_SLOTS

- The existing algorithm in bch2_btree_iter_advance() doesn't work with
peek_all_levels(): we have to defer the actual advancing until the
next time we call peek, where we have the btree path traversed and
uptodate. So, we add an advanced bit to btree_iter; when
BTREE_ITER_ALL_LEVELS is set bch2_btree_iter_advanced() just marks
the iterator as advanced.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e1b8f5f5 31-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Plumb btree_id & level to trans_mark

For backpointers, we'll need the full key location - that means btree_id
and btree level. This patch plumbs it through.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# c6b2826c 11-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Freespace, need_discard btrees

This adds two new btrees for the upcoming allocator rewrite: an extents
btree of free buckets, and a btree for buckets awaiting discards.

We also add a new trigger for alloc keys to keep the new btrees up to
date, and a compatibility path to initialize them on existing
filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3d48a7f8 31-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: KEY_TYPE_alloc_v4

This introduces a new alloc key which doesn't use varints. Soon we'll be
adding backpointers and storing them in alloc keys, which means our
pack/unpack workflow for alloc keys won't really work - we'll need to be
mutating alloc keys in place.

Instead of bch2_alloc_unpack(), we now have bch2_alloc_to_v4() that
converts older types of alloc keys to v4 if needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2a6870ad 29-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use darray for extra_journal_entries

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 3a306f3c 17-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix large key cache keys

Previously, we'd go into an infinite loop when attempting to cache a
bkey in the key cache larger than 128 u64s - since we were only using a
u8 for the size field, it'd get rounded up to 256 then truncated to 0.
Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 30985537 04-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix usage of six lock's percpu mode

Six locks have a percpu mode, which we use for interior btree nodes, as
well as btree key cache keys for the subvolumes btree. We've been
switching locks back and forth between percpu and non percpu mode as
needed, but it turns out this is racy - when we're reusing an existing
node, other threads could be attempting to lock it while we're switching
it between modes.

This patch fixes this by never switching 'struct btree' between the two
modes, and instead segragating them between two different freed lists.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# bf3efff5 27-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix race leading to btree node write getting stuck

Checking btree_node_may_write() isn't atomic with the other btree flags,
dirty and need_write in particular. There was a rare race where we'd
unblock a node from writing while __btree_node_flush() was setting
need_write, and no thread would notice that the node was now both able
to write and needed to be written.

Fix this by adding btree node flags for will_make_reachable and
write_blocked that can be checked in the cmpxchg loop in
__bch2_btree_node_write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# de517c95 26-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use x-macros for btree node flags

This is for adding an array of strings for btree node flag names.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 8322a937 04-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree key cache optimization

This helps with lock contention in the journalling code: instead of
updating our journal pin on every write, only get a journal pin if we
don't have one.

This means we can avoid hammering on journal locks nearly so much, at
the cost of carrying around a journal pin for an older entry than the
one we actually need. To handle that, if needed we update our journal
pin to the correct one when flushed by journal reclaim.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8f9ad91a 17-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix failure to allocate btree node in cache

The error code when we fail to allocate a node in the btree node cache
doesn't make it to bch2_btree_path_traverse_all(). Instead, we need to
stash a flag in btree_trans so we know we have to take the cannibalize
lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# c7ce2732 15-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Also show when blocked on write locks

This consolidates some of the btree node lock path, so that when we're
blocked taking a write lock on a node it shows up in
bch2_btree_trans_to_text(), along with intent and read locks.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 12ce5b7d 11-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree key cache coherency

- Updates to non key cache iterators will now be transparently
redirected to the key cache for cached btrees.

- Except when creating new keys: then the update goes to underlying
btree

For for iterating over a cached btree to work, we need to ensure that if
a key exists in the key cache, it also exists in the btree - otherwise
the iterator code will skip past it and not check the key cache.

Otherwise, for consistency, all updates should go to the same place -
the key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f7b6ca23 06-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_WITH_KEY_CACHE

This is the start of cache coherency with the btree key cache - this
adds a btree iterator flag that causes lookups to also check the key
cache when we're iterating over the btree (not iterating over the key
cache).

Note that we could still race with another thread creating at item in
the key cache and updating it, since we aren't holding the key cache
locked if it wasn't found. The next patch for the update path will
address this by causing the transaction to restart if the key cache is
found to be dirty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2e63e180 24-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Stash a copy of key being overwritten in btree_insert_entry

We currently need to call bch2_btree_path_peek_slot() multiple times in
the transaction commit path - and some of those need to be updated to
also check the keys from journal replay, too. Let's consolidate this and
stash the key being overwritten in btree_insert_entry.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 1f2d9192 08-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: iter->update_path

With BTREE_ITER_FILTER_SNAPSHOTS, we have to distinguish between the
path where the key was found, and the path for inserting into the
current snapshot. This adds a new field to struct btree_iter for saving
a path for the current snapshot, and plumbs it through
bch2_trans_update().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 669f87a5 03-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Switch to __func__for recording where btree_trans was initialized

Symbol decoding, via %ps, isn't supported in userspace - this will also
be faster when we're using trans->fn in the fast path, as with the new
BCH_JSET_ENTRY_log journal messages.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 5222a460 25-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_WITH_JOURNAL

This adds a new btree iterator flag, BTREE_ITER_WITH_JOURNAL, that is
automatically enabled when initializing a btree iterator before journal
replay has completed - it overlays the contents of the journal with the
btree.

This lets us delete bch2_btree_and_journal_walk() and just use the
normal btree iterator interface instead - which also lets us delete a
significant amount of duplicated code.

Note that BTREE_ITER_WITH_JOURNAL is still unoptimized in this patch -
we're redoing the binary search over keys in the journal every time we
call bch2_btree_iter_peek().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# fb64f3fd 31-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BCH_JSET_ENTRY_log

Add a journal entry type for logging messages, and add an option to use
it to log the transaction name - this makes for a very handy debugging
tool, as with it we can use the 'bcachefs list_journal' command to see
not only what updates were done, but what was doing them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# f3e1f444 21-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_NOPRESERVE

This adds a flag to not mark the initial btree_path as preserve, for
paths that we expect to be cheap to reconstitute if necessary - this
solves a btree_path overflow caused by need_whiteout_for_snapshot().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# b84d42c3 16-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Split out CONFIG_BCACHEFS_DEBUG_TRANSACTIONS

This puts the btree_transactions sysfs/debugfs file behind a separate
config option - it's highly useful, but not cheap enough to enable
permenantly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f0c3f88b 26-Oct-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Run insert triggers before overwrite triggers

Currently, btree triggers are run in natural key order, which presents a
problem for fallocate in INSERT_RANGE mode: since we're moving existing
extents to higher offsets, the trigger for deleting the old extent runs
before the trigger that adds the new extent, potentially leading to
indirect extents being deleted that shouldn't be when the delete causes
the refcount to hit 0.

This changes the order we run triggers so that for a givin btree, we run
all insert triggers before overwrite triggers, nicely sidestepping this
issue.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 3e52c222 29-Oct-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add journal_seq to inode & alloc keys

Add fields to inode & alloc keys that record the journal sequence number
when they were most recently modified.

For alloc keys, this is needed to know what journal sequence number we
have to flush before the bucket can be reused. Currently this is tracked
in memory, but we'll be getting rid of the in memory bucket array.

For inodes, this is needed for fsync when the inode has been evicted
from the vfs cache. Currently we use a bloom filter per outstanding
journal buf - but that mechanism has been broken since we added the
ability to not issue a flush/fua for every journal write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 34d74830 30-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: BTREE_UPDATE_NOJOURNAL

We're going to have btree updates that don't need to be journalled; add
a flag for that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 9a796fdb 19-Oct-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_trans_exit() no longer returns errors

Now that peek_node()/next_node() are converted to return errors
directly, we don't need bch2_trans_exit() to return errors - it's
cleaner this way and wasn't used much anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# c075ff70 04-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_FILTER_SNAPSHOTS

For snapshots, we need to implement btree lookups that return the first
key that's an ancestor of the snapshot ID the lookup is being done in -
and filter out keys in unrelated snapshots. This patch adds the btree
iterator flag BTREE_ITER_FILTER_SNAPSHOTS which does that filtering.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 14b393ee 15-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Subvolumes, snapshots

This patch adds subvolume.c - support for the subvolumes and snapshots
btrees and related data types and on disk data structures. The next
patches will start hooking up this new code to existing code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 3074bc0f 15-Sep-2021 Kent Overstreet <kent.overstreet@gmail.com>

Revert "bcachefs: Add more assertions for locking btree iterators out of order"

Figured out the bug we were chasing, and it had nothing to do with
locking btree iterators/paths out of order.

This reverts commit ff08733dd298c969aec7c7828095458f73fd5374.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 068bcaa5 03-Sep-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add more assertions for locking btree iterators out of order

btree_path_traverse_all() traverses btree iterators in sorted order, and
thus shouldn't see transaction restarts due to potential deadlocks - but
sometimes we do. This patch adds some more assertions and tracks some
more state to help track this down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 67e0dd8f 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_path

This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.

This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f21566f1 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_NODES

We really only need to distinguish between btree iterators and btree key
cache iterators - this is more prep work for btree_path.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# deb0e573 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_NEED_PEEK

This was used for an optimization that hasn't existing in quite awhile
- iter->uptodate will probably be going away as well.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 6fba6b83 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Prefer using btree_insert_entry to btree_iter

This moves some data dependencies forward, to improve pipelining.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 5f8077cc 29-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_SET_POS_AFTER_COMMIT

BTREE_ITER_SET_POS_AFTER_COMMIT is used internally to automagically
advance extent btree iterators on sucessful commit.

But with the upcomnig btree_path patch it's getting more awkward to
support, and it adds overhead to core data structures that's only used
in a few places, and can be easily done by the caller instead.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 84841b0d 24-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_dump_trans_iters_updates()

This factors out bch2_dump_trans_iters_updates() from the iter alloc
overflow path, and makes some small improvements to what it prints.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 0423fb71 12-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Keep a sorted list of btree iterators

This will be used to make other operations on btree iterators within a
transaction more efficient, and enable some other improvements to how we
manage btree iterators.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e5af273f 25-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: trans->restarted

Start tracking when btree transactions have been restarted - and assert
that we're always calling bch2_trans_begin() immediately after
transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 9f1833ca 10-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Update btree ptrs after every write

This closes a significant hole (and last known hole) in our ability to
verify metadata. Previously, since btree nodes are log structured, we
couldn't detect lost btree writes that weren't the first write to a
given node. Additionally, this seems to have lead to some significant
metadata corruption on multi device filesystems with metadata
replication: since a write may have made it to one device and not
another, if we read that btree node back from the replica that did have
that write and started appending after that point, the other replica
would have a gap in the bset entries and reading from that replica
wouldn't find the rest of the bsets.

But, since updates to interior btree nodes are now journalled, we can
close this hole by updating pointers to btree nodes after every write
with the currently written number of sectors, without negatively
affecting performance. This means we will always detect lost or corrupt
metadata - it also means that our btree is now a curious hybrid of COW
and non COW btrees, with all the benefits of both (excluding
complexity).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# b00fde8f 05-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE

Add a new flag to control assertions about updating to internal snapshot
nodes, that normally should not be written to - to be used in an
upcoming patch.

Also do some renaming - trigger_flags is now update_flags.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 297d8934 10-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Extensive triggers cleanups

- We no longer mark subsets of extents, they're marked like regular
keys now - which means we can drop the offset & sectors arguments
to trigger functions
- Drop other arguments that are no longer needed anymore in various
places - fs_usage
- Drop the logic for handling extents in bch2_mark_update() that isn't
needed anymore, to match bch2_trans_mark_update()
- Better logic for hanlding the BTREE_ITER_CACHED_NOFILL case, where we
don't have an old key to mark

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# a49e9a05 12-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix null ptr deref when splitting compressed extents

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# cd8319fd 07-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill trans->updates2

Now that extent handling has been lifted to bch2_trans_update(), we
don't need to keep two different lists of updates.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 5288e66a 03-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_WITH_UPDATES

This drops bch2_btree_iter_peek_with_updates() and replaces it with a
new flag, BTREE_ITER_WITH_UPDATES, and also reworks
bch2_btree_iter_peek_slot() to respect it too.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 509d3e0a 20-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Child btree iterators

This adds the ability for btree iterators to own child iterators - to be
used by an upcoming rework of bch2_btree_iter_peek_slot(), so we can
scan forwards while maintaining our current position.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 66a0a497 04-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_iter->should_be_locked

Add a field to struct btree_iter for tracking whether it should be
locked - this fixes spurious transaction restarts in
bch2_trans_relock().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 531a0095 04-Jun-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve btree iterator tracepoints

This patch adds some new tracepoints to the btree iterator code, and
adds new fields to the existing tracepoints - primarily for the iterator
position.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# d7f35163 09-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix BTREE_ITER_NOT_EXTENTS

bch2_btree_iter_peek() wasn't properly checking for
BTREE_ITER_IS_EXTENTS when updating iter->pos.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3ce8b463 03-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: kill bset_tree->max_key

Since we now ensure a btree node's max key fits in its packed format,
this isn't needed for the reasons it used to be - and, it was being used
inconsistently.

Also reorder struct btree a bit for performance, and kill some dead
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5c1d808a 31-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Drop trans->nounlock

Since we're no longer doing btree node merging post commit, we can now
delete a bunch of code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e751c01a 24-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Start using bpos.snapshot field

This patch starts treating the bpos.snapshot field like part of the key
in the btree code:

* bpos_successor() and bpos_predecessor() now include the snapshot field
* Keys in btrees that will be using snapshots (extents, inodes, dirents
and xattrs) now always have their snapshot field set to U32_MAX

The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that
determines whether we're iterating over keys in all snapshots or not -
internally, this controlls whether bkey_(successor|predecessor)
increment/decrement the snapshot field, or only the higher bits of the
key.

We add a new member to struct btree_iter, iter->snapshot: when
BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always
equal iter->snapshot, which will be 0 for btrees that don't use
snapshots, and alsways U32_MAX for btrees that will use snapshots
(until we enable snapshot creation).

This patch also introduces a new metadata version number, and compat
code for reading from/writing to older versions - this isn't a forced
upgrade (yet).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 43d00243 03-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a mechanism for running callbacks at trans commit time

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 331194a2 24-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree key cache locking improvements

The btree key cache mutex was becoming a significant bottleneck - it was
mainly used to protect the lists of dirty, clean and freed cached keys.

This patch eliminates the dirty and clean lists - instead, when we need
to scan for keys to drop from the cache we iterate over the rhashtable,
and thus we're able to remove most uses of that lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c7bb769c 19-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: __bch2_trans_get_iter() refactoring, BTREE_ITER_NOT_EXTENTS

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1f7fdc0a 20-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_iter_live()

New helper to clean things up a bit - also, improve iter->flags
handling.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6333bd2f 20-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve handling of extents in bch2_trans_update()

The transaction update/commit path cares about whether it's inserting
extents or regular keys; extents require extra passes (handling of
overlapping extents) but sometimes we want to skip all that. This
clarifies things by adding a new member to btree_insert_entry specifying
whether the key being inserted is an extent, instead of overloading
BTREE_ITER_IS_EXTENTS.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 41f8b09e 20-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Rename BTREE_ID enums for consistency with other enums

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f2785955 19-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill support for !BTREE_NODE_NEW_EXTENT_OVERWRITE()

bcachefs has been aggressively migrating filesystems and btree nodes to
the new format for quite some time - this shouldn't affect anyone
anymore, and lets us delete a _lot_ of code. Also, it frees up
KEY_TYPE_discard for a new whiteout key type for snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e131b6aa 23-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a mempool for btree_trans bump allocator

This allocation is required for filesystem operations to make forward
progress, thus needs a mempool.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1889ad5a 14-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add code to scan for/rewite old btree nodes

This adds a new data job type to scan for btree nodes in the old extent
format, and rewrite them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 7e1a3aa9 11-Feb-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: iter->real_pos

We need to differentiate between the search position of a btree
iterator, vs. what it actually points at (what we found). This matters
for extents, where iter->pos will typically be the start of the key we
found and iter->real_pos will be the end of the key we found (which soon
won't necessarily be in the same btree node!) and it will also matter
for snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 07a1006a 17-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Reduce/kill BKEY_PADDED use

With various newer key types - stripe keys, inline data extents - the
old approach of calculating the maximum size of the value is becoming
more and more error prone. Better to switch to bkey_on_stack, which can
dynamically allocate if necessary to handle any size bkey.

In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5db43418 03-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't issue btree writes that weren't journalled

If we have an error in the btree interior update path that prevents us
from journalling the update, we can't issue the corresponding btree node
write - we didn't get a journal sequence number that would cause it to
be ignored in recovery.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3eb26d01 01-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_trans_get_iter() no longer returns errors

Since we now always preallocate the maximum number of iterators when we
initialize a btree transaction, getting an iterator never fails - we can
delete a fair amount of error path code.

This patch also simplifies the iterator allocation code a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d0022290 29-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix error in filesystem initialization

The rhashtable code doesn't like when we destroy an rhashtable that was
never initialized

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d5425a3b 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Throttle updates when btree key cache is too dirty

This is needed to ensure we don't deadlock because journal reclaim and
thus memory reclaim isn't making forward progress.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 12590720 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve btree key cache shrinker

The shrinker should start scanning for entries that can be freed oldest
to newest - this way, we can avoid scanning a lot of entries that are
too new to be freed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 628a3ad2 12-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a shrinker for the btree key cache

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 876c7af3 15-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Take a SRCU lock in btree transactions

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6a747c46 09-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add accounting for dirty btree nodes/keys

This lets us improve journal reclaim, so that it now tries to make sure
no more than 3/4s of the btree node cache and btree key cache are dirty
- ensuring the shrinkers can free memory.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ae1ede58 02-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't embed btree iters in btree_trans

These haven't been in used since reallocing iterators has been disabled,
and saves us a lot of stack if we get rid of it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 29364f34 02-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Drop sysfs interface to debug parameters

It's not used much anymore, the module paramter interface is better.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# dcf141b9 28-Oct-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix spurious transaction restarts

The check for whether locking a btree node would deadlock was wrong - we
have to check that interior nodes are locked before descendents, but
this check was wrong when consider cached vs. non cached iterators.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 39283c71 19-Oct-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix for bad stripe pointers

The allocator usually doesn't increment bucket gens right away on
buckets that it's about to hand out (for reasons that need to be
documented), instead deferring that to whatever extent update first
references that bucket.

But stripe pointers reference buckets without changing bucket sector
counts, meaning we could end up with a pointer in a stripe with a gen
newer than the bucket it points to.

Fix this by adding a transactional trigger for KEY_TYPE_stripe that just
writes out the keys in the alloc btree for the buckets it points to.

Also - consolidate the code that checks pointer validity.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f3721e12 16-Oct-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Perf improvements for bch_alloc_read()

On large filesystems reading in the alloc info takes a significant
amount of time. But we don't need to be calling into the fully general
bch2_mark_key() path, just open code what we need in
bch2_alloc_read_fn().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4580baec 25-Jul-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Remove some uses of PAGE_SIZE in the btree code

For portability to userspace, we should try to avoid working in kernel
pages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 697e45b2 06-Jul-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_TRIGGER_NOOVERWRITES

This is prep work for reworking the triggers machinery - we have
triggers that need to know both the old and the new key.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# fff899b1 03-Jul-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Mark btree nodes as needing rewrite when not all replicas are RW

This fixes a bug where recovery fails when one of the devices is read
only.

Also - consolidate the "must rewrite this node to insert it" behind a
new btree node flag.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 64f2a880 28-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix bch2_extent_can_insert() not being called

It's supposed to check whether we're splitting a compressed extent and
if so get a bigger disk reservation - hence this fixes a "disk usage
increased by x without a reservaiton" bug.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e27b03b3 15-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Give bkey_cached_key same attributes as bpos

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2ca88e5a 07-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree key cache

This introduces a new kind of btree iterator, cached iterators, which
point to keys cached in a hash table. The cache also acts as a write
cache - in the update path, we journal the update but defer updating the
btree until the cached entry is flushed by journal reclaim.

Cache coherency is for now up to the users to handle, which isn't ideal
but should be good enough for now.

These new iterators will be used for updating inodes and alloc info (the
alloc and stripes btrees).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 515282ac 12-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a deadlock

__bch2_btree_node_lock() was incorrectly using iter->pos as a proxy for
btree node lock ordering, this caused an off by one error that was
triggered by bch2_btree_node_get_sibling() getting the previous node.

This refactors the code to compare against btree node keys directly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f96c0df4 02-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a deadlock in bch2_btree_node_get_sibling()

There was a bad interaction with bch2_btree_iter_set_pos_same_leaf(),
which can leave a btree node locked that is just outside iter->pos,
breaking the lock ordering checks in __bch2_btree_node_lock(). Ideally
we should get rid of this corner case, but for now fix it locally with
verbose comments.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 495aabed 02-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add debug code to print btree transactions

Intented to help debug deadlocks, since we can't use lockdep to check
btree node lock ordering.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 00b8ccf7 25-May-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Interior btree updates are now fully transactional

We now update the alloc info (bucket sector counts) atomically with
journalling the update to the interior btree nodes, and we also set new
btree roots atomically with the journalled part of the btree update.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 96e2aa1b 25-May-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a mechanism for passing extra journal entries to bch2_trans_commit()

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0329b150 01-Apr-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Trace where btree iterators are allocated

This will help with iterator overflow bugs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2c31e657 29-Mar-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Reduce max nr of btree iters when lockdep is on

This is so we don't overflow MAX_LOCK_DEPTH.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6357d607 08-Feb-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Journal updates to interior nodes

Previously, the btree has always been self contained and internally
consistent on disk without anything from the journal - the journal just
contained pointers to the btree roots.

However, this meant that btree node split or compact operations - i.e.
anything that changes btree node topology and involves updates to
interior nodes - would require that interior btree node to be written
immediately, which means emitting a btree node write that's mostly empty
(using 4k of space on disk if the filesystemm blocksize is 4k to only
write perhaps ~100 bytes of new keys).

More importantly, this meant most btree node writes had to be FUA, and
consumer drives have a history of slow and/or buggy FUA support - other
filesystes have been bit by this.

This patch changes the interior btree update path to journal updates to
interior nodes, after the writes for the new btree nodes have completed.
Best of all, it turns out to simplify the interior node update path
somewhat.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e62d65f2 15-Mar-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: trans_commit() path can now insert to interior nodes

This will be needed for the upcoming patches to journal updates to
interior btree nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e3e464ac 30-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Move extent overwrite handling out of core btree code

Ever since the btree code was first written, handling of overwriting
existing extents - including partially overwriting and splittin existing
extents - was handled as part of the core btree insert path. The modern
transaction and iterator infrastructure didn't exist then, so that was
the only way for it to be done.

This patch moves that outside of the core btree code to a pass that runs
at transaction commit time.

This is a significant simplification to the btree code and overall
reduction in code size, but more importantly it gets us much closer to
the core btree code being completely independent of extents and is
important prep work for snapshots.

This introduces a new feature bit; the old and new extent update models
are incompatible when the filesystem needs journal replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2dac0eae 18-Feb-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Iterator debug code improvements

More aggressively checking iterator invariants, and fixing the resulting
bugs. Also greatly simplifying iter_next() and iter_next_slot() - they
were hyper optimized before, but the optimizations were getting too
brittle.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 237e8048 18-Feb-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: introduce b->hash_val

This is partly prep work for introducing bch_btree_ptr_v2, but it'll
also be a bit of a performance boost by moving the full key out of the
hot part of struct btree.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d98a5e39 09-Jan-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add some comments for btree iterator flags

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 24326cd1 31-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Sort & deduplicate updates in bch2_trans_update()

Previously, when doing multiple update in the same transaction commit
that overwrote each other, we relied on doing the updates in the same
order as the bch2_trans_update() calls in order to get the correct
result. But that wasn't correct for triggers; bch2_trans_mark_update()
when marking overwrites would do the wrong thing because it hadn't seen
the update that was being overwritten.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2d594dfb 31-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Split out btree_trigger_flags

The trigger flags really belong with individual btree_insert_entries,
not the transaction commit flags - this splits out those flags and
unifies them with the BCH_BUCKET_MARK flags. Todo - split out
btree_trigger.c from buckets.c

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 54e86b58 30-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Make btree_insert_entry more private to update path

This should be private to btree_update_leaf.c, and we might end up
removing it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# bcd6f3e0 26-Nov-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use KEY_TYPE_deleted whitouts for extents

Previously, partial overwrites of existing extents were handled
implicitly by the btree code; when reading in a btree node, we'd do a
mergesort of the different bsets and detect and fix partially
overlapping extents during that mergesort.

That approach won't work with snapshots: this changes extents to work
like regular keys as far as the btree code is concerned, where a 0 size
KEY_TYPE_deleted whiteout will completely overwrite an existing extent.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8b3bbe2c 24-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't reexecute triggers when retrying transaction commit

This was causing a bug with transaction iterators overflowing; now, if
triggers have to be reexecuted we always return -EINTR and retry from
the start of the transaction.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c297a763 13-Dec-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Refactor whiteouts compaction

The whiteout compaction path - as opposed to just dropping whiteouts -
is now only needed for extents, and soon will only be needed for extent
btree nodes in the old format.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c9bebae6 29-Nov-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Whiteout changes

More prep work for snapshots: extents will soon be using
KEY_TYPE_deleted for whiteouts, with 0 size. But we wen't be able to
keep these whiteouts with the rest of the extents in the btree node, due
to sorting invariants breaking.

We can deal with this by immediately moving the new whiteouts to the
unwritten whiteouts area - this just means those whiteouts won't be
sorted, so we need new code to sort them prior to merging them with the
rest of the keys to be written.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2a9101a9 19-Oct-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Refactor bch2_trans_commit() path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8f1965391 19-Oct-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Make btree_node_type_needs_gc() cheaper

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 64bc0011 26-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Rework btree iterator lifetimes

The btree_trans struct needs to memoize/cache btree iterators, so that
on transaction restart we don't have to completely redo btree lookups,
and so that we can do them all at once in the correct order when the
transaction had to restart to avoid a deadlock.

This switches the btree iterator lookups to work based on iterator
position, instead of trying to match them up based on the stack trace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a7199432 22-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill deferred btree updates

Will be replaced by cached btree iterators

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# bbd8d203 22-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_SLOTS isn't a type of btree iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 36e9d698 07-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Do updates in order they were queued up in

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 76426098 16-Aug-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Reflink

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 88767d65 24-Jun-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Update path now handles triggers that generate more triggers

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3838be78 15-May-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't use a fixed size buffer for fs_usage_deltas

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 61011ea2 14-May-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Rip out old hacky transaction restart tracing

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 87c3beb4 15-May-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a bug with spinning on the journal

Transactional triggers meant that when we failed to get a journal
reservation, then bailed out into the error path to block on a journal
reservation, the second blocking call into the journal code was asking
for less space, which is not what we want.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 60755344 10-May-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: kill BTREE_ITER_NOUNLOCK

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 932aa837 11-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_trans_mark_update()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c43a6ef9 05-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_bkey_cached_common

This is prep work for the btree key cache: btree iterators will point to
either struct btree, or a new struct bkey_cached.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ba5c6557 22-Apr-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add actual tracepoints for transaction restarts

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ece254b2 04-Apr-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: don't lose errors from iterators that have been freed

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e1120a4c 27-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add iter->idx

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ecc892e4 27-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill btree_iter->next

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 9e5e5b9e 25-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree iterators now always have a btree_trans

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5df4be3f 25-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree iter improvements

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# dc3b63dc 21-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add time stats for btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 9623ab27 15-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree update path cleanup

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0dc17247 13-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: kill struct btree_insert

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c93cead0 16-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Always use bch2_extent_trim_atomic()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a8e00bd4 07-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: increase BTREE_ITER_MAX

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3e5d6c59 19-Feb-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use journal preres for deferred btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8fe826f9 13-Feb-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Convert bucket invalidation to key marking path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 39fbc5a4 11-Feb-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: gc lock no longer needed for disk reservations

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 42b72e0b 24-Jan-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: journal_replay_early()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3636ed48 17-Jul-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Deferred btree updates

Will be used in the future for inode updates, which will be very helpful
for multithreaded workloads that have to update the inode with every
extent update (appends, or updates that change i_sectors)

Also will be used eventually for fully persistent alloc info

However - we still need a mechanism for reserving space in the journal
prior to getting a journal reservation, so it's not technically safe to
make use of this just yet, we could deadlock with the journal full
(although not likely to be an issue in practice)

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f0cfb963 29-Nov-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Track nr_inodes with the key marking machinery

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 26609b61 01-Nov-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Make bkey types globally unique

this lets us get rid of a lot of extra switch statements - in a lot of
places we dispatch on the btree node type, and then the key type, so
this is a nice cleanup across a lot of code.

Also improve the on disk format versioning stuff.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ad7ae8d6 23-Nov-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree locking fix, refactoring

Hit an assertion, probably spurious, indicating an iterator was unlocked
when it shouldn't have been (spurious because it wasn't locked at all
when the caller called btree_insert_at()).

Add a flag, BTREE_ITER_NOUNLOCK, and tighten up the assertions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1d25849c 07-Nov-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Centralize marking of replicas in btree update path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2252aa27 21-Oct-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree gc refactoring

prep work for erasure coding

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ef337c54 06-Oct-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Allocation code refactoring

bch2_alloc_sectors_start() was a nightmare to work with - it's got some
tricky stuff to do, since it wants to use the buckets the writepoint
already has, unless they're not in the target it wants to write to,
unless it can't allocate from any other devices in which case it will
use those buckets if it has to - et cetera.

This restructures the code to start with a new empty list of open
buckets we're going to use for the new allocation, pulling buckets from
the write point's list as we decide that we really are going to use
them - making the code somewhat more functional and drastically easier
to understand.

Also fixes a bug where we could end up waiting on c->freelist_wait
(because allocating from one device failed) but return success from
bch2_bucket_alloc(), because allocating from a different device
succeeded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# fc3268c1 08-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: kill extent_insert_hook

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# cc1add4a 05-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_INSERT_JOURNAL_RES_FULL is no longer possible

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e4ccb251 05-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: make struct btree_iter a bit smaller

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 271a3d3a 21-Jul-2016 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: lift ordering restriction on 0 size extents

This lifts the restriction that 0 size extents must not overlap with
other extents, which means we can now sort extents and non extents the
same way, and will let us simplify a bunch of other stuff as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1fe08f31 05-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bkey_written()

also cleanups of btree node offsets

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 617391ba 05-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: improved rw_aux_tree_bsearch()

shouldn't be any reason for an actual binary search here

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b0004d8d 03-Aug-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Factor out btree_key_can_insert()

working on getting rid of all the reasons bch2_insert_fixup_extent() can
fail/stop partway, which is needed for other refactorings.

One of the reasons we could have to bail out is if we're splitting a
compressed extent we might need to add to our disk reservation - but we
can check that before actually starting the insert.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1c7a0adf 12-Jul-2018 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: trace transaction restarts

exceptionally crappy "tracing", but it's a start at documenting the
places restarts can be triggered

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1c6fdbd8 17-Mar-2017 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Initial commit

Initially forked from drivers/md/bcache, bcachefs is a new copy-on-write
filesystem with every feature you could possibly want.

Website: https://bcachefs.org

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>