History log of /linux-master/fs/bcachefs/btree_key_cache.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# adfe9357 20-Apr-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Tweak btree key cache shrinker so it actually frees

Freeing key cache items is a multi stage process; we need to wait for an
SRCU grace period to elapse, and we handle this ourselves - partially to
avoid callback overhead, but primarily so that when allocating we can
first allocate from the freed items waiting for an SRCU grace period.

Previously, the shrinker was counting the items on the 'waiting for SRCU
grace period' lists as items being scanned, but this meant that too many
items waiting for an SRCU grace period could prevent it from doing any
work at all.

After this, we're seeing that items skipped due to the accessed bit are
the main cause of the shrinker not making any progress, and we actually
want the key cache shrinker to run quite aggressively because reclaimed
items will still generally be found (more compactly) in the btree node
cache - so we also tweak the shrinker to not count those against
nr_to_scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 09e913f5 25-Mar-2024 Hongbo Li <lihongbo22@huawei.com>

bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list

When allocating bkey_cached from bc->freed_pcpu list, it missed
decreasing the count of nr_freed_pcpu which would cause the mismatch
between the value of nr_freed_pcpu and the list items. This problem
also exists in moving new bkey_cached to bc->freed_pcpu list.
If these happened, the bug info may appear in
bch2_fs_btree_key_cache_exit by the follow code:

BUG_ON(list_count_nodes(&bc->freed_pcpu) != bc->nr_freed_pcpu);
BUG_ON(list_count_nodes(&bc->freed_nonpcpu) != bc->nr_freed_nonpcpu);

Fixes: c65c13f0eac6 ("bcachefs: Run btree key cache shrinker less aggressively")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6088234c 05-Apr-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: JOURNAL_SPACE_LOW

"bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant
performance regression (nearly 50%) on multithreaded random writes with
fio.

The reason is that the journal watermark checks multiple things,
including the state of the btree write buffer, and on multithreaded
update heavy workloads we're bottleneked on write buffer flushing - we
don't want kicknig off btree updates to depend on the state of the write
buffer.

This isn't strictly correct; the interior btree update path does do
write buffer updates, but it's a tiny fraction of total accounting
updates and we're more concerned with space in the journal itself.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3ed94062 17-Mar-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improve bch2_fatal_error()

error messages should always include __func__

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b3f8e711 10-Mar-2024 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix btree key cache coherency during replay

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ccb7b08f 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: trans_for_each_path() no longer uses path->idx

path->idx is now a code smell: we should be using path_idx_t, since it's
stable across btree path reallocation.

This is also a bit faster, using the same loop counter vs. fetching
path->idx from each path we iterate over.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 7f9821a7 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_insert_entry -> btree_path_idx_t

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 07f383c7 03-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_iter -> btree_path_idx_t

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0c0ba8e9 19-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: skip journal more often in key cache reclaim

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e4e49375 10-Dec-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs; kill bch2_btree_key_cache_flush()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# cf5bacb6 26-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: delete useless commit_do()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3c471b65 26-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: convert bch_fs_flags to x-macro

Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b4b79b07 13-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't rejournal keys in key cache flush

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# cb52d23e 11-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Rename BTREE_INSERT flags

BTREE_INSERT flags are actually transaction commit flags - rename them
for clarity.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0117591e 30-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't drop journal pins in exit path

There's no need to drop journal pins in our exit paths - the code was
trying to have everything cleaned up on any shutdown, but better to just
tweak the assertions a bit.

This fixes a bug where calling into journal reclaim in the exit path
would cass a null ptr deref.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 006ccc30 04-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill journal pre-reservations

This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# c65c13f0 06-Nov-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Run btree key cache shrinker less aggressively

The btree key cache maintains lists of items that have been freed, but
can't yet be reclaimed because a bch2_trans_relock() call might find
them - we're waiting for SRCU readers to release.

Previously, we wouldn't count these items against the number we're
attempting to scan for, which would mean we'd evict more live key cache
entries - doing quite a bit of potentially unecessary work.

With recent work to make sure we don't hold SRCU locks for too long, it
should be safe to count all the items on the freelists against number to
scan - even if we can't reclaim them yet, we will be able to soon.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# be9e782d 27-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't downgrade locks on transaction restart

We should only be downgrading locks on success - otherwise, our
transaction restarts won't be getting the correct locks and we'll
livelock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a1d97d84 19-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix shrinker names

Shrinkers are now exported to debugfs, so the names can't have slashes
in them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 88dfe193 19-Oct-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_id_str()

Since we can run with unknown btree IDs, we can't directly index btree
IDs into fixed size arrays.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ecae0bd5 02-Nov-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:

- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'

- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested

- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:

mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'

- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code

- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'

- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'

- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'

- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'

- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames

- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use

- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code

- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'

- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2

- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'

- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says

- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()

- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'

- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans

- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'

- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU

- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code

- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result

- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions

- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements

- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things

- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'

- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults

- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code

- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'

- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'

- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'

- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'

- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'

- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'

- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'

- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark

- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'

- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'

- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'

- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in

https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

with help from Qi Zheng.

The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...


# 6bd68ec2 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Heap allocate btree_trans

We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 96dea3d5 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix W=12 build errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f7ed15eb 12-Sep-2023 Nathan Chancellor <nathan@kernel.org>

bcachefs: Fix -Wformat in bch2_btree_key_cache_to_text()

When building bcachefs for 32-bit ARM, there is a compiler warning in
bch2_btree_key_cache_to_text() due to use of an incorrect format
specifier:

fs/bcachefs/btree_key_cache.c:1060:36: error: format specifies type 'size_t' (aka 'unsigned int') but the argument has type 'long' [-Werror,-Wformat]
1060 | prt_printf(out, "nr_freed:\t%zu", atomic_long_read(&c->nr_freed));
| ~~~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| %ld
fs/bcachefs/util.h:223:54: note: expanded from macro 'prt_printf'
223 | #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__)
| ^~~~~~~~~~~
1 error generated.

On 64-bit architectures, size_t is 'unsigned long', so there is no
warning when using %zu but on 32-bit architectures, size_t is
'unsigned int'. Use '%lu' to match the other format specifiers used in
this function for printing values returned from atomic_long_read().

Fixes: 6d799930ce0f ("bcachefs: btree key cache pcpu freedlist")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 5b7fbdcd 09-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix silent enum conversion error

This changes mark_btree_node_locked() to take an enum
btree_node_locked_type, not a six_lock_type, since BTREE_NODE_UNLOCKED
is -1 which may cause problems converting back and forth to
six_lock_type if short enums are in use.

With this change, we never store BTREE_NODE_UNLOCKED in a six_lock_type
enum.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 5eaa76d8 13-Jul-2023 Mikulas Patocka <mpatocka@redhat.com>

bcachefs: mark bch_inode_info and bkey_cached as reclaimable

Mark these caches as reclaimable, so that available memory is correctly
reported when there is a lot of cached inodes.

Note that more work is needed - you should add __GFP_RECLAIMABLE to some
of the kmalloc calls, so that they are allocated from the "kmalloc-rcl-*"
caches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 30a8278a 09-Jul-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add new assertions for shutdown path

We've been seeing assertions pop that indicate the btree node cache or
key cache have dirty items when we just did a clean shutdown.

Add some more assertions so we can catch this when we're dirtying items.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f33c58fc 27-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill BTREE_INSERT_USE_RESERVE

Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ec14fc60 27-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill JOURNAL_WATERMARK

This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# b3591acc 26-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: unregister_shrinker() now safe on not-registered shrinker

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# d95dd378 28-May-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: allocate_dropping_locks()

Add two new helpers for allocating memory with btree locks held: The
idea is to first try the allocation with GFP_NOWAIT|__GFP_NOWARN, then
if that fails - unlock, retry with GFP_KERNEL, and then call
trans_relock().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1fb4fe63 20-May-2023 Kent Overstreet <kent.overstreet@linux.dev>

six locks: Kill six_lock_state union

As suggested by Linus, this drops the six_lock_state union in favor of
raw bitmasks.

On the one hand, bitfields give more type-level structure to the code.
However, a significant amount of the code was working with
six_lock_state as a u64/atomic64_t, and the conversions from the
bitfields to the u64 were deemed a bit too out-there.

More significantly, because bitfield order is poorly defined (#ifdef
__LITTLE_ENDIAN_BITFIELD can be used, but is gross), incrementing the
sequence number would overflow into the rest of the bitfield if the
compiler didn't put the sequence number at the high end of the word.

The new code is a bit saner when we're on an architecture without real
atomic64_t support - all accesses to lock->state now go through
atomic64_*() operations.

On architectures with real atomic64_t support, we additionally use
atomic bit ops for setting/clearing individual bits.

Text size: 7467 bytes -> 4649 bytes - compilers still suck at
bitfields.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 0d2234a7 20-May-2023 Kent Overstreet <kent.overstreet@linux.dev>

six locks: Kill six_lock_pcpu_(alloc|free)

six_lock_pcpu_alloc() is an unsafe interface: it's not safe to allocate
or free the percpu reader count on an existing lock that's in use, the
only safe time to allocate percpu readers is when the lock is first
being initialized.

This patch adds a flags parameter to six_lock_init(), and instead of
six_lock_pcpu_free() we now expose six_lock_exit(), which does the same
thing but is less likely to be misused.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# bcb79a51 29-Apr-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_bkey_get_iter() helpers

Introduce new helpers for a common pattern:

bch2_trans_iter_init();
bch2_btree_iter_peek_slot();

- bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of
the correct type
- bch2_bkey_get_val_typed() copies the val out of the btree to a
(typically stack allocated) variable; it handles the case where the
value in the btree is smaller than the current version of the type,
zeroing out the remainder.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 65d48e35 14-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Private error codes: ENOMEM

This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e53d03fe 02-Mar-2023 Brian Foster <bfoster@redhat.com>

bcachefs: don't bump key cache journal seq on nojournal commits

fstest generic/388 occasionally reproduces corruptions where an
inode has extents beyond i_size. This is a deliberate crash and
recovery test, and the post crash+recovery characteristics are
usually the same: the inode exists on disk in an early (i.e. just
allocated) state based on the journal sequence number associated
with the inode. Subsequent inode updates exist in the journal at
higher sequence numbers, but the inode hadn't been written back
before the associated crash and the post-crash recovery processes a
set of journal sequence numbers that doesn't include updates to the
inode. In fact, the sequence with the most recent inode key update
always happens to be the sequence just before the front of the
journal processed by recovery.

This last bit is a significant hint that the problem relates to an
on-disk journal update of the front of the journal. The root cause
of this problem is basically that the inode is updated (multiple
times) in-core and in the key cache, each time bumping the key cache
sequence number used to control the cache flush. The cache flush
skips one or more times, bumping the associated key cache journal
pin to the key cache seq value. This has a side effect of holding
the inode in memory a bit longer than normal, which helps exacerbate
this problem, but is also unsafe in certain cases where the key
cache seq may have been updated by a transaction commit that didn't
journal the associated key.

For example, consider an inode that has been allocated, updated
several times in the key cache, journaled, but not yet written back.
At this stage, everything should be consistent if the fs happens to
crash because the latest update has been journal. Now consider a key
update via bch2_extent_update_i_size_sectors() that uses the
BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode
state, it can have the side effect of bumping ck->seq in
bch2_btree_insert_key_cached(). In turn, if a subsequent key cache
flush skips due to seq not matching the former, the ck->journal pin
is updated to ck->seq even though the most recent key update was not
journaled. If this pin happens to reside at the front (tail) of the
journal, this means a subsequent journal write can update last_seq
to a value beyond that which includes the most recent update to the
inode. If this occurs and the fs happens to crash before the inode
happens to flush, recovery will see the latest last_seq, fail to
recover the inode and leave the inode in the inconsistent state
described above.

To avoid this problem, skip the key cache seq update on NOJOURNAL
commits, except on initial pin add. Pass the insert entry directly
to bch2_btree_insert_key_cached() to make the associated flag
available and be consistent with btree_insert_key_leaf().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ac2ccddc 04-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Drop some anonymous structs, unions

Rust bindgen doesn't cope well with anonymous structs and unions. This
patch drops the fancy anonymous structs & unions in bkey_i that let us
use the same helpers for bkey_i and bkey_packed; since bkey_packed is an
internal type that's never exposed to outside code, it's only a minor
inconvenienc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3329cf1b 02-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Centralize btree node lock initialization

This fixes some confusion in the lockdep code due to initializing btree
node/key cache locks with the same lockdep key, but different names.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 30ca6ece 09-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill trans->flags

Recursive transaction commits are occasionally necessary - in
particular, for the upcoming btree write buffer's flush path.

This avoids bugs due to trans->flags being accidentally mutated
mid-commit, which can cause c->writes refcount leaks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 5b3008bc 02-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't call bch2_journal_pin_drop() under key cache lock

This fixes a (harmless) lockdep splat, due to a lock order violation in
the key cache exit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 94c69faf 04-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Use six_lock_ip()

This uses the new _ip() interface to six locks and hooks it up to
btree_path->ip_allocated, when available.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# b8c5b16f 24-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't emit tracepoints for expected events

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 6c36318c 07-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: key cache: Don't hold btree locks while using GFP_RECLAIM

This is something we need to do more widely: instead of bothering with
GFP_NOIO/GFP_NOFS, if we need to allocate memory while holding locks:

- first attempt the allocation with GFP_NOWAIT
- if that fails, drop btree locks with bch2_trans_unlock(), then
retry with GFP_KERNEL.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 7af365eb 07-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improve bkey_cached_lock_for_evict()

We don't need a write lock to check if a key is dirty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 6f90e6b2 25-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix a livelock in key cache fill path

We weren't setting path->uptodate before calling
bch2_btree_key_cache_fill() - which causes __bch2_btree_path_upgrade()
to fail.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1617d56d 22-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Key cache now works for snapshots btrees

This switches btree_key_cache_fill() to use a btree iterator, not a
btree path, so that it can search for keys in previous snapshots.

We also add another iterator flag, BTREE_ITER_KEY_CACHE_FILL, to avoid
recursion back into the key cache.

This will allow us to re-enable the key cache for inodes in the next
patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 087e53c2 20-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Bring back BTREE_ITER_CACHED_NOFILL

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# e88a75eb 24-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: New bpos_cmp(), bkey_cmp() replacements

This patch introduces
- bpos_eq()
- bpos_lt()
- bpos_le()
- bpos_gt()
- bpos_ge()

and equivalent replacements for bkey_cmp().

Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 061f7999 14-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix a use after free

This fixes a regression from percpu freedlists in the btree key cache
code: in a rare error path, we were immediately freeing a bkey_cached
that had been used before and should've waited for an SRCU barrier.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3e3e02e6 19-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Assorted checkpatch fixes

checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:

ASSIGN_IN_IF
UNSPECIFIED_INT - bcachefs coding style prefers single token type names
NEW_TYPEDEFS - typedefs are occasionally good
FUNCTION_ARGUMENTS - we prefer to look at functions in .c files
(hopefully with docbook documentation), not .h
file prototypes
MULTISTATEMENT_MACRO_USE_DO_WHILE
- we have _many_ x-macros and other macros where
we can't do this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# b2f83e76 17-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree key cache shrinker fix

The shrinker assumes freed key cache items are ordered by age, so that
it doesn't have to scan the full list to find items that are old enough
(according to the srcu code) to be freed.

But percpu freelists broke this ordering; this patch fixes this by
ensuring we insert items into the proper position.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# fe5b37f6 14-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree key cache improvements

- In userspace, we don't have real percpu variables; this patch
disables the percpu freelists in userspace
- add some error messages for the asserts in
bch2_fs_btree_key_cache_exit(); we've been hitting this (only in
userspace, oddly), perhaps this will help us track down the error.
- bkey_cached_reuse() should likely be taking the key cache lock, and
it's a slowpath so it doesn't hurt to

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 0196eb89 14-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_key_cache_scan() doesn't need trylock

We don't actually allocate memory under the btree key cache lock - so
there's no recursion concerns, and the shrinker can just use
mutex_lock().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 99e2146b 26-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Break out bch2_btree_path_traverse_cached_slowpath()

Prep work for further refactoring.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 0d7009d7 22-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Delete old deadlock avoidance code

This deletes our old lock ordering based deadlock avoidance code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 1bb91233 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Ensure intent locks are marked before taking write locks

Locks must be correctly marked for the cycle detector to work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 38474c26 02-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Avoid using btree_node_lock_nopath()

With the upcoming cycle detector, we have to be careful about using
btree_node_lock_nopath - in particular, using it to take write locks can
cause deadlocks.

All held locks need to be tracked in a btree_path, so that the cycle
detector knows about them - unless we know that we cannot cause
deadlocks for other reasons: e.g. we are only taking read locks, or
we're in very early fsck (topology repair).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3d21d48e 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix usage of six lock's percpu mode, key cache version

Similar to "bcachefs: Fix usage of six lock's percpu mode", six locks
have a percpu mode, but we can't switch between percpu and non percpu
modes while a lock is in use: threads attempting to take a read lock may
race, and we'll end up with the read count permanently off.

Fixing this the "correct" way, in six_lock_pcpu_(alloc|free) would
require an RCU barrier, and we don't want to do that - instead, we have
to permanently segragate percpu and non percpu objects, including when
on freelists.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 0242130f 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Refactor bkey_cached_alloc() path

Clean up the arguments passed and make them more consistent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# da4474f2 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Convert more locking code to btree_bkey_cached_common

Ideally, all the code in btree_locking.c should be converted, but then
we'd want to convert btree_path to point to btree_key_cached_common too,
and then we'd be in for a much bigger cleanup - but a bit of incremental
cleanup will still be helpful for the next patches.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 4e6defd1 31-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_bkey_cached_common->cached

Add a type descriptor to btree_bkey_cached_common - there's no reason
not to since we've got padding that was otherwise unused, and this is a
nice cleanup (and helpful in later patches).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# d5024b01 22-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_node_lock_write_nofail()

Taking a write lock will be able to fail, with the new cycle detector -
unless we pass it nofail, which is possible but not preferred.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# ca7d8fca 21-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: New locking functions

In the future, with the new deadlock cycle detector, we won't be using
bare six_lock_* anymore: lock wait entries will all be embedded in
btree_trans, and we will need a btree_trans context whenever locking a
btree node.

This patch plumbs a btree_trans to the few places that need it, and adds
two new locking functions
- btree_node_lock_nopath, which may fail returning a transaction
restart, and
- btree_node_lock_nopath_nofail, to be used in places where we know we
cannot deadlock (i.e. because we're holding no other locks).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# c919f53f 30-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't leak lock pcpu counts memory

This fixes a small memory leak.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 674cfc26 26-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add persistent counters for all tracepoints

Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 06a53943 25-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Correctly initialize bkey_cached->lock

We need to use the right class for some assertions to work correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 45b033fa 11-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix assertion in bch2_btree_key_cache_drop()

Turns out this assertion was something we could legitimately hit - add a
comment describing what's going on, and handle it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 6fae65c1 10-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_CACHED_(NOFILL|NOCREATE)

These were used more prior to getting rid of the in-memory bucket arrays
- they don't serve much purpose anymore, and deleting them lets us write
better assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 9f96568c 09-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Tracepoint improvements

Our types are exported to the tracepoint code, so it's not necessary to
break things out individually when passing them to tracepoints - we can
also call other functions from TP_fast_assign().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 315c9ba6 10-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_NO_NODE -> BCH_ERR codes

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 49e401fa 07-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Tracepoint improvements

- use strlcpy(), not strncpy()
- add tracepoints for btree_path alloc and free
- give the tracepoint for key cache upgrade fail a proper name
- add a tracepoint for btree_node_upgrade_fail

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# ae33e7a2 03-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add distinct error code for key_cache_upgrade

This aids in debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 549d173c 17-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: EINTR -> BCH_ERR_transaction_restart

Now that we have error codes, with subtypes, we can switch to our own
error code for transaction restarts - and even better, a distinct error
code for each transaction restart reason: clearer code and better
debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# c807ca95 14-Jul-2022 Daniel Hill <daniel@gluo.nz>

bcachefs: added lock held time stats

We now record the length of time btree locks are held and expose this in debugfs.

Enabled via CONFIG_BCACHEFS_LOCK_TIME_STATS.

Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8bfe14e8 14-Jul-2022 Daniel Hill <daniel@gluo.nz>

bcachefs: lock time stats prep work.

We need the caller name and a place to store our results, btree_trans provides this.

Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8f7f566f 16-Jun-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree key cache pcpu freedlist

Originally, the btree key cache code would always allocate new entries
by reusing from the recently-freed list, if that list wasn't empty. But
that behaviour was dropped, for lock contention reasons.

But it seems that entries stranded on the freed list have been
contributing to some of our oom issues, because long running btree
transactions will prevent them from being freed.

This patch re-adds allocating from the freed list, but it also adds
percpu buffers to solve the lock contention issues - and the new percpu
freed lists will improve the evict paths, too.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 401ec4db 03-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Printbuf rework

This converts bcachefs to the modern printbuf interface/implementation,
synced with the version to be submitted upstream.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# a729e489 17-Apr-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Allocate some extra room in btree_key_cache_fill()

If we allocate a buffer that's a bit bigger than necessary the
transaction commit path will be much less likely to have to reallocate -
which requires a transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 502f973d 09-Apr-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a few warnings on 32 bit

These showed up when building for mips.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 31f63fd1 14-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Introduce a separate journal watermark for copygc

Since journal reclaim -> btree key cache flushing may require the
allocation of new btree nodes, it has an implicit dependency on copygc
in order to make forward progress - so we should avoid blocking copygc
unless the journal is really close to full.

This introduces watermarks to replace our single MAY_GET_UNRESERVED bit
in the journal, and adds a watermark for copygc and plumbs it through.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 30985537 04-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix usage of six lock's percpu mode

Six locks have a percpu mode, which we use for interior btree nodes, as
well as btree key cache keys for the subvolumes btree. We've been
switching locks back and forth between percpu and non percpu mode as
needed, but it turns out this is racy - when we're reusing an existing
node, other threads could be attempting to lock it while we're switching
it between modes.

This patch fixes this by never switching 'struct btree' between the two
modes, and instead segragating them between two different freed lists.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8322a937 04-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree key cache optimization

This helps with lock contention in the journalling code: instead of
updating our journal pin on every write, only get a journal pin if we
don't have one.

This means we can avoid hammering on journal locks nearly so much, at
the cost of carrying around a journal pin for an older entry than the
one we actually need. To handle that, if needed we update our journal
pin to the correct one when flushed by journal reclaim.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8be1aff0 15-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Delete redundant tracepoint

We were emitting two trace events on transaction restart in this code
path - delete the redundant one.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 12ce5b7d 11-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree key cache coherency

- Updates to non key cache iterators will now be transparently
redirected to the key cache for cached btrees.

- Except when creating new keys: then the update goes to underlying
btree

For for iterating over a cached btree to work, we need to ensure that if
a key exists in the key cache, it also exists in the btree - otherwise
the iterator code will skip past it and not check the key cache.

Otherwise, for consistency, all updates should go to the same place -
the key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f7b6ca23 06-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_WITH_KEY_CACHE

This is the start of cache coherency with the btree key cache - this
adds a btree iterator flag that causes lookups to also check the key
cache when we're iterating over the btree (not iterating over the key
cache).

Note that we could still race with another thread creating at item in
the key cache and updating it, since we aren't holding the key cache
locked if it wasn't found. The next patch for the update path will
address this by causing the transaction to restart if the key cache is
found to be dirty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# a9c0b125 11-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve btree_key_cache_flush_pos()

btree_key_cache_flush_pos() uses BTREE_ITER_CACHED_NOFILL - but it
wasn't checking for !ck->valid. It does check for the entry being dirty,
so it shouldn't matter, but this refactor it a bit and adds and
assertion.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# bc82d08b 08-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Tracepoint improvements

This improves the transaction restart tracepoints - adding distinct
tracepoints for all the locations and reasons a transaction might have
been restarted, and ensures that there's a tracepoint for every
transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 03ea3962 04-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Log & error message improvements

- Add a shim uuid_unparse_lower() in the kernel, since %pU doesn't work
in userspace

- We don't need to print the bcachefs: or the filesystem name prefix in
userspace

- Improve a few error messages

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 669f87a5 03-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Switch to __func__for recording where btree_trans was initialized

Symbol decoding, via %ps, isn't supported in userspace - this will also
be faster when we're using trans->fn in the fast path, as with the new
BCH_JSET_ENTRY_log journal messages.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# f0f41a6d 30-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add error messages for memory allocation failures

This adds some missing diagnostics from rare but annoying to debug
runtime allocation failure paths.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 99fafb04 20-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix some shutdown path bugs

This fixes some bugs when we hit an error very early in the filesystem
startup path, before most things have been initialized.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# c075ff70 04-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_FILTER_SNAPSHOTS

For snapshots, we need to implement btree lookups that return the first
key that's an ancestor of the snapshot ID the lookup is being done in -
and filter out keys in unrelated snapshots. This patch adds the btree
iterator flag BTREE_ITER_FILTER_SNAPSHOTS which does that filtering.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 14b393ee 15-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Subvolumes, snapshots

This patch adds subvolume.c - support for the subvolumes and snapshots
btrees and related data types and on disk data structures. The next
patches will start hooking up this new code to existing code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 3074bc0f 15-Sep-2021 Kent Overstreet <kent.overstreet@gmail.com>

Revert "bcachefs: Add more assertions for locking btree iterators out of order"

Figured out the bug we were chasing, and it had nothing to do with
locking btree iterators/paths out of order.

This reverts commit ff08733dd298c969aec7c7828095458f73fd5374.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 068bcaa5 03-Sep-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add more assertions for locking btree iterators out of order

btree_path_traverse_all() traverses btree iterators in sorted order, and
thus shouldn't see transaction restarts due to potential deadlocks - but
sometimes we do. This patch adds some more assertions and tracks some
more state to help track this down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 67e0dd8f 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_path

This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.

This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# deb0e573 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_NEED_PEEK

This was used for an optimization that hasn't existing in quite awhile
- iter->uptodate will probably be going away as well.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 78cf784e 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Further reduce iter->trans usage

This is prep work for splitting btree_path out from btree_iter -
btree_path will not have a pointer to btree_trans.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 9f6bd307 24-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Reduce iter->trans usage

Disfavoured, and should go away.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 1a488e73 27-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_INSERT_NOUNLOCK

With the recent transaction restart changes, it's no longer needed - all
transaction commits have BTREE_INSERT_NOUNLOCK semantics.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# e5af273f 25-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: trans->restarted

Start tracking when btree transactions have been restarted - and assert
that we're always calling bch2_trans_begin() immediately after
transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# a6eba44b 23-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use bch2_trans_do() in bch2_btree_key_cache_journal_flush()

We're working to standardize handling of transaction restarts.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 5f87f3c1 20-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't downgrade in traverse()

Downgrading of btree iterators is something that should only happen
explicitly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# 5aab6635 14-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Tighten up btree_iter locking assertions

We weren't correctly verifying that we had interior node intent locks -
this patch also fixes bugs uncovered by the new assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# b00fde8f 05-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE

Add a new flag to control assertions about updating to internal snapshot
nodes, that normally should not be written to - to be used in an
upcoming patch.

Also do some renaming - trigger_flags is now update_flags.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

# baa65029 27-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Change bch2_btree_key_cache_count() to exclude dirty keys

We're seeing livelocks that appear to be due to
bch2_btree_key_cache_scan repeatedly scanning and blocking other tasks
from using the key cache lock - we probably shouldn't be reporting
objects that can't actually be freed yet.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 4932e07e 24-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix key cache assertion

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# bc2e5d5c 23-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix an out of bounds read

bch2_varint_decode() can read up to 7 bytes past the end of the buffer,
which means we need to allocate slightly larger key cache buffers.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f09517fc 20-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a deadlock on journal reclaim

Flushing the btree key cache needs to use allocation reserves - journal
reclaim depends on flushing the btree key cache for making forward
progress, and the allocator and copygc depend on journal reclaim making
forward progress.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 241e2636 31-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't flush btree writes more aggressively because of btree key cache

We need to flush the btree key cache when it's too dirty, because
otherwise the shrinker won't be able to reclaim memory - this is done by
journal reclaim. But journal reclaim also kicks btree node writes: this
meant that btree node writes were getting kicked much too often just
because we needed to flush btree key cache keys.

This patch splits journal pins into two different lists, and teaches
journal reclaim to not flush btree node writes when it only needs to
flush key cache keys.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f72b1fd7 04-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a startup race

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2940295c 03-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Be more careful about JOURNAL_RES_GET_RESERVED

JOURNAL_RES_GET_RESERVED should only be used for updatse that need to be
done to free up space in the journal. In particular, when we're flushing
keys from the key cache, if we're flushing them out of order we
shouldn't be using it, since we're using up our remaining space in the
journal without dropping a pin that will let us make forward progress.

With this patch, BTREE_INSERT_JOURNAL_RECLAIM without
BTREE_INSERT_JOURNAL_RESERVED may return -EAGAIN - we can't wait on
journal reclaim if we're already in journal reclaim.

This means we need to propagate these errors up to journal reclaim,
indicating that flushing a journal pin should be retried in the future.

This is prep work for a patch to change the way journal reclaim works,
to split out flushing key cache keys because the btree key cache is too
dirty from journal reclaim because we need space in the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 4cf91b02 04-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Split out bpos_cmp() and bkey_cmp()

With snapshots, we're going to need to differentiate between comparisons
that should and shouldn't include the snapshot field. bpos_cmp is now
the comparison function that does include the snapshot field, used by
core btree code.

Upper level filesystem code generally does _not_ want to compare against
the snapshot field - that code wants keys to compare as equal even when
one of them is in an ancestor snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 331194a2 24-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree key cache locking improvements

The btree key cache mutex was becoming a significant bottleneck - it was
mainly used to protect the lists of dirty, clean and freed cached keys.

This patch eliminates the dirty and clean lists - instead, when we need
to scan for keys to drop from the cache we iterate over the rhashtable,
and thus we're able to remove most uses of that lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8d956c2f 19-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_iter_set_dontneed()

This is a bit clearer than using bch2_btree_iter_free().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 53b3e3c0 08-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix locking in bch2_btree_iter_traverse_cached()

bch2_btree_iter_traverse() is supposed to ensure we have the correct
type of lock - it was downgrading if necessary, but if we entered with a
read lock it wasn't upgrading to an intent lock, oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3187aa8d 21-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't use BTREE_INSERT_USE_RESERVE so much

Previously, we were using BTREE_INSERT_RESERVE in a lot of places where
it no longer makes sense.

- we now have more open_buckets than we used to, and the reserves work
better, so we shouldn't need to use BTREE_INSERT_RESERVE just because
we're holding open_buckets pinned anymore.

- We have the btree key cache for updates to the alloc btree, meaning
we no longer need the btree reserve to ensure the allocator can make
forward progress.

This means that we should only need a reserve for btree updates to
ensure that copygc can make forward progress.

Since it's now just for copygc, we can also fold RESERVE_BTREE into
RESERVE_MOVINGGC (the allocator's freelist reserve).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 1d8305c1 13-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add some cond_rescheds() in shutdown path

Particularly on emergency shutdown we can end up having to clean up a
lot of dirty cached btree keys here.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# b206df6e 03-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix some spurious gcc warnings

These only come up when building in userspace, for some reason.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 3eb26d01 01-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_trans_get_iter() no longer returns errors

Since we now always preallocate the maximum number of iterators when we
initialize a btree transaction, getting an iterator never fails - we can
delete a fair amount of error path code.

This patch also simplifies the iterator allocation code a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# d7b04163 30-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Change a BUG_ON() to a fatal error

In the btree key cache code, failing to flush a dirty key is a serious
error, but it doesn't need to be a BUG_ON(), we can stop the filesystem
instead.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# d0022290 29-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix error in filesystem initialization

The rhashtable code doesn't like when we destroy an rhashtable that was
never initialized

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# b7a9bbfc 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Move journal reclaim to a kthread

This is to make tracing easier.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8a92e545 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Ensure journal reclaim runs when btree key cache is too dirty

Ensuring the key cache isn't too dirty is critical for ensuring that the
shrinker can reclaim memory.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 12590720 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve btree key cache shrinker

The shrinker should start scanning for entries that can be freed oldest
to newest - this way, we can avoid scanning a lot of entries that are
too new to be freed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 14ba3706 18-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a kmem_cache for btree_key_cache objects

We allocate a lot of these, and we're seeing sporading OOMs - this will
help with tracking those down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 628a3ad2 12-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a shrinker for the btree key cache

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# f526d26d 11-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix btree key cache shutdown

On emergency shutdown, we might still have dirty keys in the btree key
cache that need to be cleaned up properly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 6a747c46 09-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add accounting for dirty btree nodes/keys

This lets us improve journal reclaim, so that it now tries to make sure
no more than 3/4s of the btree node cache and btree key cache are dirty
- ensuring the shrinkers can free memory.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 73e7470b 05-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: More inlinining in the btree key cache code

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# a301dc38 28-Oct-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve tracing for transaction restarts

We have a bug where we can get stuck with a process spinning in
transaction restarts - need more information.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 8cad3e2f 22-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use cached iterators for inode updates

This switches inode updates to use cached btree iterators - which should
be a nice performance boost, since lock contention on the inodes btree
can be a bottleneck on multithreaded workloads.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# d211b408 15-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix lock ordering with new btree cache code

The code that checks lock ordering was recently changed to go off of the
pos of the btree node, rather than the iterator, but the btree cache
code didn't update to handle iterators that point to cached bkeys. Oops

Also, update various debug code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 2ca88e5a 07-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree key cache

This introduces a new kind of btree iterator, cached iterators, which
point to keys cached in a hash table. The cache also acts as a write
cache - in the update path, we journal the update but defer updating the
btree until the cached entry is flushed by journal reclaim.

Cache coherency is for now up to the users to handle, which isn't ideal
but should be good enough for now.

These new iterators will be used for updating inodes and alloc info (the
alloc and stripes btrees).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# 6bd68ec2 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Heap allocate btree_trans

We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 96dea3d5 12-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix W=12 build errors

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f7ed15eb 12-Sep-2023 Nathan Chancellor <nathan@kernel.org>

bcachefs: Fix -Wformat in bch2_btree_key_cache_to_text()

When building bcachefs for 32-bit ARM, there is a compiler warning in
bch2_btree_key_cache_to_text() due to use of an incorrect format
specifier:

fs/bcachefs/btree_key_cache.c:1060:36: error: format specifies type 'size_t' (aka 'unsigned int') but the argument has type 'long' [-Werror,-Wformat]
1060 | prt_printf(out, "nr_freed:\t%zu", atomic_long_read(&c->nr_freed));
| ~~~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| %ld
fs/bcachefs/util.h:223:54: note: expanded from macro 'prt_printf'
223 | #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__)
| ^~~~~~~~~~~
1 error generated.

On 64-bit architectures, size_t is 'unsigned long', so there is no
warning when using %zu but on 32-bit architectures, size_t is
'unsigned int'. Use '%lu' to match the other format specifiers used in
this function for printing values returned from atomic_long_read().

Fixes: 6d799930ce0f ("bcachefs: btree key cache pcpu freedlist")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5b7fbdcd 09-Sep-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix silent enum conversion error

This changes mark_btree_node_locked() to take an enum
btree_node_locked_type, not a six_lock_type, since BTREE_NODE_UNLOCKED
is -1 which may cause problems converting back and forth to
six_lock_type if short enums are in use.

With this change, we never store BTREE_NODE_UNLOCKED in a six_lock_type
enum.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5eaa76d8 13-Jul-2023 Mikulas Patocka <mpatocka@redhat.com>

bcachefs: mark bch_inode_info and bkey_cached as reclaimable

Mark these caches as reclaimable, so that available memory is correctly
reported when there is a lot of cached inodes.

Note that more work is needed - you should add __GFP_RECLAIMABLE to some
of the kmalloc calls, so that they are allocated from the "kmalloc-rcl-*"
caches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 30a8278a 09-Jul-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add new assertions for shutdown path

We've been seeing assertions pop that indicate the btree node cache or
key cache have dirty items when we just did a clean shutdown.

Add some more assertions so we can catch this when we're dirtying items.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f33c58fc 27-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill BTREE_INSERT_USE_RESERVE

Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ec14fc60 27-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill JOURNAL_WATERMARK

This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b3591acc 26-Jun-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: unregister_shrinker() now safe on not-registered shrinker

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d95dd378 28-May-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: allocate_dropping_locks()

Add two new helpers for allocating memory with btree locks held: The
idea is to first try the allocation with GFP_NOWAIT|__GFP_NOWARN, then
if that fails - unlock, retry with GFP_KERNEL, and then call
trans_relock().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1fb4fe63 20-May-2023 Kent Overstreet <kent.overstreet@linux.dev>

six locks: Kill six_lock_state union

As suggested by Linus, this drops the six_lock_state union in favor of
raw bitmasks.

On the one hand, bitfields give more type-level structure to the code.
However, a significant amount of the code was working with
six_lock_state as a u64/atomic64_t, and the conversions from the
bitfields to the u64 were deemed a bit too out-there.

More significantly, because bitfield order is poorly defined (#ifdef
__LITTLE_ENDIAN_BITFIELD can be used, but is gross), incrementing the
sequence number would overflow into the rest of the bitfield if the
compiler didn't put the sequence number at the high end of the word.

The new code is a bit saner when we're on an architecture without real
atomic64_t support - all accesses to lock->state now go through
atomic64_*() operations.

On architectures with real atomic64_t support, we additionally use
atomic bit ops for setting/clearing individual bits.

Text size: 7467 bytes -> 4649 bytes - compilers still suck at
bitfields.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0d2234a7 20-May-2023 Kent Overstreet <kent.overstreet@linux.dev>

six locks: Kill six_lock_pcpu_(alloc|free)

six_lock_pcpu_alloc() is an unsafe interface: it's not safe to allocate
or free the percpu reader count on an existing lock that's in use, the
only safe time to allocate percpu readers is when the lock is first
being initialized.

This patch adds a flags parameter to six_lock_init(), and instead of
six_lock_pcpu_free() we now expose six_lock_exit(), which does the same
thing but is less likely to be misused.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# bcb79a51 29-Apr-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_bkey_get_iter() helpers

Introduce new helpers for a common pattern:

bch2_trans_iter_init();
bch2_btree_iter_peek_slot();

- bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of
the correct type
- bch2_bkey_get_val_typed() copies the val out of the btree to a
(typically stack allocated) variable; it handles the case where the
value in the btree is smaller than the current version of the type,
zeroing out the remainder.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 65d48e35 14-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Private error codes: ENOMEM

This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e53d03fe 02-Mar-2023 Brian Foster <bfoster@redhat.com>

bcachefs: don't bump key cache journal seq on nojournal commits

fstest generic/388 occasionally reproduces corruptions where an
inode has extents beyond i_size. This is a deliberate crash and
recovery test, and the post crash+recovery characteristics are
usually the same: the inode exists on disk in an early (i.e. just
allocated) state based on the journal sequence number associated
with the inode. Subsequent inode updates exist in the journal at
higher sequence numbers, but the inode hadn't been written back
before the associated crash and the post-crash recovery processes a
set of journal sequence numbers that doesn't include updates to the
inode. In fact, the sequence with the most recent inode key update
always happens to be the sequence just before the front of the
journal processed by recovery.

This last bit is a significant hint that the problem relates to an
on-disk journal update of the front of the journal. The root cause
of this problem is basically that the inode is updated (multiple
times) in-core and in the key cache, each time bumping the key cache
sequence number used to control the cache flush. The cache flush
skips one or more times, bumping the associated key cache journal
pin to the key cache seq value. This has a side effect of holding
the inode in memory a bit longer than normal, which helps exacerbate
this problem, but is also unsafe in certain cases where the key
cache seq may have been updated by a transaction commit that didn't
journal the associated key.

For example, consider an inode that has been allocated, updated
several times in the key cache, journaled, but not yet written back.
At this stage, everything should be consistent if the fs happens to
crash because the latest update has been journal. Now consider a key
update via bch2_extent_update_i_size_sectors() that uses the
BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode
state, it can have the side effect of bumping ck->seq in
bch2_btree_insert_key_cached(). In turn, if a subsequent key cache
flush skips due to seq not matching the former, the ck->journal pin
is updated to ck->seq even though the most recent key update was not
journaled. If this pin happens to reside at the front (tail) of the
journal, this means a subsequent journal write can update last_seq
to a value beyond that which includes the most recent update to the
inode. If this occurs and the fs happens to crash before the inode
happens to flush, recovery will see the latest last_seq, fail to
recover the inode and leave the inode in the inconsistent state
described above.

To avoid this problem, skip the key cache seq update on NOJOURNAL
commits, except on initial pin add. Pass the insert entry directly
to bch2_btree_insert_key_cached() to make the associated flag
available and be consistent with btree_insert_key_leaf().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ac2ccddc 04-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Drop some anonymous structs, unions

Rust bindgen doesn't cope well with anonymous structs and unions. This
patch drops the fancy anonymous structs & unions in bkey_i that let us
use the same helpers for bkey_i and bkey_packed; since bkey_packed is an
internal type that's never exposed to outside code, it's only a minor
inconvenienc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3329cf1b 02-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Centralize btree node lock initialization

This fixes some confusion in the lockdep code due to initializing btree
node/key cache locks with the same lockdep key, but different names.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 30ca6ece 09-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill trans->flags

Recursive transaction commits are occasionally necessary - in
particular, for the upcoming btree write buffer's flush path.

This avoids bugs due to trans->flags being accidentally mutated
mid-commit, which can cause c->writes refcount leaks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 5b3008bc 02-Mar-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't call bch2_journal_pin_drop() under key cache lock

This fixes a (harmless) lockdep splat, due to a lock order violation in
the key cache exit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 94c69faf 04-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Use six_lock_ip()

This uses the new _ip() interface to six locks and hooks it up to
btree_path->ip_allocated, when available.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b8c5b16f 24-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't emit tracepoints for expected events

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6c36318c 07-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: key cache: Don't hold btree locks while using GFP_RECLAIM

This is something we need to do more widely: instead of bothering with
GFP_NOIO/GFP_NOFS, if we need to allocate memory while holding locks:

- first attempt the allocation with GFP_NOWAIT
- if that fails, drop btree locks with bch2_trans_unlock(), then
retry with GFP_KERNEL.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 7af365eb 07-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Improve bkey_cached_lock_for_evict()

We don't need a write lock to check if a key is dirty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6f90e6b2 25-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix a livelock in key cache fill path

We weren't setting path->uptodate before calling
bch2_btree_key_cache_fill() - which causes __bch2_btree_path_upgrade()
to fail.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1617d56d 22-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Key cache now works for snapshots btrees

This switches btree_key_cache_fill() to use a btree iterator, not a
btree path, so that it can search for keys in previous snapshots.

We also add another iterator flag, BTREE_ITER_KEY_CACHE_FILL, to avoid
recursion back into the key cache.

This will allow us to re-enable the key cache for inodes in the next
patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 087e53c2 20-Dec-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Bring back BTREE_ITER_CACHED_NOFILL

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# e88a75eb 24-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: New bpos_cmp(), bkey_cmp() replacements

This patch introduces
- bpos_eq()
- bpos_lt()
- bpos_le()
- bpos_gt()
- bpos_ge()

and equivalent replacements for bkey_cmp().

Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 061f7999 14-Nov-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix a use after free

This fixes a regression from percpu freedlists in the btree key cache
code: in a rare error path, we were immediately freeing a bkey_cached
that had been used before and should've waited for an SRCU barrier.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3e3e02e6 19-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Assorted checkpatch fixes

checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:

ASSIGN_IN_IF
UNSPECIFIED_INT - bcachefs coding style prefers single token type names
NEW_TYPEDEFS - typedefs are occasionally good
FUNCTION_ARGUMENTS - we prefer to look at functions in .c files
(hopefully with docbook documentation), not .h
file prototypes
MULTISTATEMENT_MACRO_USE_DO_WHILE
- we have _many_ x-macros and other macros where
we can't do this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b2f83e76 17-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree key cache shrinker fix

The shrinker assumes freed key cache items are ordered by age, so that
it doesn't have to scan the full list to find items that are old enough
(according to the srcu code) to be freed.

But percpu freelists broke this ordering; this patch fixes this by
ensuring we insert items into the proper position.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# fe5b37f6 14-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree key cache improvements

- In userspace, we don't have real percpu variables; this patch
disables the percpu freelists in userspace
- add some error messages for the asserts in
bch2_fs_btree_key_cache_exit(); we've been hitting this (only in
userspace, oddly), perhaps this will help us track down the error.
- bkey_cached_reuse() should likely be taking the key cache lock, and
it's a slowpath so it doesn't hurt to

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0196eb89 14-Oct-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_key_cache_scan() doesn't need trylock

We don't actually allocate memory under the btree key cache lock - so
there's no recursion concerns, and the shrinker can just use
mutex_lock().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 99e2146b 26-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Break out bch2_btree_path_traverse_cached_slowpath()

Prep work for further refactoring.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0d7009d7 22-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Delete old deadlock avoidance code

This deletes our old lock ordering based deadlock avoidance code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 1bb91233 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Ensure intent locks are marked before taking write locks

Locks must be correctly marked for the cycle detector to work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 38474c26 02-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Avoid using btree_node_lock_nopath()

With the upcoming cycle detector, we have to be careful about using
btree_node_lock_nopath - in particular, using it to take write locks can
cause deadlocks.

All held locks need to be tracked in a btree_path, so that the cycle
detector knows about them - unless we know that we cannot cause
deadlocks for other reasons: e.g. we are only taking read locks, or
we're in very early fsck (topology repair).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3d21d48e 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix usage of six lock's percpu mode, key cache version

Similar to "bcachefs: Fix usage of six lock's percpu mode", six locks
have a percpu mode, but we can't switch between percpu and non percpu
modes while a lock is in use: threads attempting to take a read lock may
race, and we'll end up with the read count permanently off.

Fixing this the "correct" way, in six_lock_pcpu_(alloc|free) would
require an RCU barrier, and we don't want to do that - instead, we have
to permanently segragate percpu and non percpu objects, including when
on freelists.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 0242130f 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Refactor bkey_cached_alloc() path

Clean up the arguments passed and make them more consistent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# da4474f2 03-Sep-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Convert more locking code to btree_bkey_cached_common

Ideally, all the code in btree_locking.c should be converted, but then
we'd want to convert btree_path to point to btree_key_cached_common too,
and then we'd be in for a much bigger cleanup - but a bit of incremental
cleanup will still be helpful for the next patches.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4e6defd1 31-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: btree_bkey_cached_common->cached

Add a type descriptor to btree_bkey_cached_common - there's no reason
not to since we've got padding that was otherwise unused, and this is a
nice cleanup (and helpful in later patches).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d5024b01 22-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bch2_btree_node_lock_write_nofail()

Taking a write lock will be able to fail, with the new cycle detector -
unless we pass it nofail, which is possible but not preferred.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# ca7d8fca 21-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: New locking functions

In the future, with the new deadlock cycle detector, we won't be using
bare six_lock_* anymore: lock wait entries will all be embedded in
btree_trans, and we will need a btree_trans context whenever locking a
btree node.

This patch plumbs a btree_trans to the few places that need it, and adds
two new locking functions
- btree_node_lock_nopath, which may fail returning a transaction
restart, and
- btree_node_lock_nopath_nofail, to be used in places where we know we
cannot deadlock (i.e. because we're holding no other locks).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# c919f53f 30-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Don't leak lock pcpu counts memory

This fixes a small memory leak.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 674cfc26 26-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add persistent counters for all tracepoints

Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 06a53943 25-Aug-2022 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Correctly initialize bkey_cached->lock

We need to use the right class for some assertions to work correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 45b033fa 11-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix assertion in bch2_btree_key_cache_drop()

Turns out this assertion was something we could legitimately hit - add a
comment describing what's going on, and handle it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 6fae65c1 10-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_CACHED_(NOFILL|NOCREATE)

These were used more prior to getting rid of the in-memory bucket arrays
- they don't serve much purpose anymore, and deleting them lets us write
better assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 9f96568c 09-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Tracepoint improvements

Our types are exported to the tracepoint code, so it's not necessary to
break things out individually when passing them to tracepoints - we can
also call other functions from TP_fast_assign().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 315c9ba6 10-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_NO_NODE -> BCH_ERR codes

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 49e401fa 07-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Tracepoint improvements

- use strlcpy(), not strncpy()
- add tracepoints for btree_path alloc and free
- give the tracepoint for key cache upgrade fail a proper name
- add a tracepoint for btree_node_upgrade_fail

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# ae33e7a2 03-Aug-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add distinct error code for key_cache_upgrade

This aids in debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 549d173c 17-Jul-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: EINTR -> BCH_ERR_transaction_restart

Now that we have error codes, with subtypes, we can switch to our own
error code for transaction restarts - and even better, a distinct error
code for each transaction restart reason: clearer code and better
debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# c807ca95 14-Jul-2022 Daniel Hill <daniel@gluo.nz>

bcachefs: added lock held time stats

We now record the length of time btree locks are held and expose this in debugfs.

Enabled via CONFIG_BCACHEFS_LOCK_TIME_STATS.

Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8bfe14e8 14-Jul-2022 Daniel Hill <daniel@gluo.nz>

bcachefs: lock time stats prep work.

We need the caller name and a place to store our results, btree_trans provides this.

Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8f7f566f 16-Jun-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree key cache pcpu freedlist

Originally, the btree key cache code would always allocate new entries
by reusing from the recently-freed list, if that list wasn't empty. But
that behaviour was dropped, for lock contention reasons.

But it seems that entries stranded on the freed list have been
contributing to some of our oom issues, because long running btree
transactions will prevent them from being freed.

This patch re-adds allocating from the freed list, but it also adds
percpu buffers to solve the lock contention issues - and the new percpu
freed lists will improve the evict paths, too.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 401ec4db 03-Feb-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Printbuf rework

This converts bcachefs to the modern printbuf interface/implementation,
synced with the version to be submitted upstream.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a729e489 17-Apr-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Allocate some extra room in btree_key_cache_fill()

If we allocate a buffer that's a bit bigger than necessary the
transaction commit path will be much less likely to have to reallocate -
which requires a transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 502f973d 09-Apr-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a few warnings on 32 bit

These showed up when building for mips.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 31f63fd1 14-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Introduce a separate journal watermark for copygc

Since journal reclaim -> btree key cache flushing may require the
allocation of new btree nodes, it has an implicit dependency on copygc
in order to make forward progress - so we should avoid blocking copygc
unless the journal is really close to full.

This introduces watermarks to replace our single MAY_GET_UNRESERVED bit
in the journal, and adds a watermark for copygc and plumbs it through.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 30985537 04-Mar-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix usage of six lock's percpu mode

Six locks have a percpu mode, which we use for interior btree nodes, as
well as btree key cache keys for the subvolumes btree. We've been
switching locks back and forth between percpu and non percpu mode as
needed, but it turns out this is racy - when we're reusing an existing
node, other threads could be attempting to lock it while we're switching
it between modes.

This patch fixes this by never switching 'struct btree' between the two
modes, and instead segragating them between two different freed lists.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8322a937 04-Jan-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Btree key cache optimization

This helps with lock contention in the journalling code: instead of
updating our journal pin on every write, only get a journal pin if we
don't have one.

This means we can avoid hammering on journal locks nearly so much, at
the cost of carrying around a journal pin for an older entry than the
one we actually need. To handle that, if needed we update our journal
pin to the correct one when flushed by journal reclaim.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8be1aff0 15-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Delete redundant tracepoint

We were emitting two trace events on transaction restart in this code
path - delete the redundant one.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 12ce5b7d 11-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree key cache coherency

- Updates to non key cache iterators will now be transparently
redirected to the key cache for cached btrees.

- Except when creating new keys: then the update goes to underlying
btree

For for iterating over a cached btree to work, we need to ensure that if
a key exists in the key cache, it also exists in the btree - otherwise
the iterator code will skip past it and not check the key cache.

Otherwise, for consistency, all updates should go to the same place -
the key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f7b6ca23 06-Feb-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_WITH_KEY_CACHE

This is the start of cache coherency with the btree key cache - this
adds a btree iterator flag that causes lookups to also check the key
cache when we're iterating over the btree (not iterating over the key
cache).

Note that we could still race with another thread creating at item in
the key cache and updating it, since we aren't holding the key cache
locked if it wasn't found. The next patch for the update path will
address this by causing the transaction to restart if the key cache is
found to be dirty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a9c0b125 11-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve btree_key_cache_flush_pos()

btree_key_cache_flush_pos() uses BTREE_ITER_CACHED_NOFILL - but it
wasn't checking for !ck->valid. It does check for the entry being dirty,
so it shouldn't matter, but this refactor it a bit and adds and
assertion.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# bc82d08b 08-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Tracepoint improvements

This improves the transaction restart tracepoints - adding distinct
tracepoints for all the locations and reasons a transaction might have
been restarted, and ensures that there's a tracepoint for every
transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 03ea3962 04-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Log & error message improvements

- Add a shim uuid_unparse_lower() in the kernel, since %pU doesn't work
in userspace

- We don't need to print the bcachefs: or the filesystem name prefix in
userspace

- Improve a few error messages

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 669f87a5 03-Jan-2022 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Switch to __func__for recording where btree_trans was initialized

Symbol decoding, via %ps, isn't supported in userspace - this will also
be faster when we're using trans->fn in the fast path, as with the new
BCH_JSET_ENTRY_log journal messages.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# f0f41a6d 30-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add error messages for memory allocation failures

This adds some missing diagnostics from rare but annoying to debug
runtime allocation failure paths.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 99fafb04 20-Dec-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix some shutdown path bugs

This fixes some bugs when we hit an error very early in the filesystem
startup path, before most things have been initialized.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# c075ff70 04-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_ITER_FILTER_SNAPSHOTS

For snapshots, we need to implement btree lookups that return the first
key that's an ancestor of the snapshot ID the lookup is being done in -
and filter out keys in unrelated snapshots. This patch adds the btree
iterator flag BTREE_ITER_FILTER_SNAPSHOTS which does that filtering.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 14b393ee 15-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Subvolumes, snapshots

This patch adds subvolume.c - support for the subvolumes and snapshots
btrees and related data types and on disk data structures. The next
patches will start hooking up this new code to existing code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 3074bc0f 15-Sep-2021 Kent Overstreet <kent.overstreet@gmail.com>

Revert "bcachefs: Add more assertions for locking btree iterators out of order"

Figured out the bug we were chasing, and it had nothing to do with
locking btree iterators/paths out of order.

This reverts commit ff08733dd298c969aec7c7828095458f73fd5374.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 068bcaa5 03-Sep-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add more assertions for locking btree iterators out of order

btree_path_traverse_all() traverses btree iterators in sorted order, and
thus shouldn't see transaction restarts due to potential deadlocks - but
sometimes we do. This patch adds some more assertions and tracks some
more state to help track this down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 67e0dd8f 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_path

This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.

This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# deb0e573 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_ITER_NEED_PEEK

This was used for an optimization that hasn't existing in quite awhile
- iter->uptodate will probably be going away as well.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 78cf784e 30-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Further reduce iter->trans usage

This is prep work for splitting btree_path out from btree_iter -
btree_path will not have a pointer to btree_trans.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 9f6bd307 24-Aug-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Reduce iter->trans usage

Disfavoured, and should go away.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 1a488e73 27-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Kill BTREE_INSERT_NOUNLOCK

With the recent transaction restart changes, it's no longer needed - all
transaction commits have BTREE_INSERT_NOUNLOCK semantics.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# e5af273f 25-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: trans->restarted

Start tracking when btree transactions have been restarted - and assert
that we're always calling bch2_trans_begin() immediately after
transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# a6eba44b 23-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use bch2_trans_do() in bch2_btree_key_cache_journal_flush()

We're working to standardize handling of transaction restarts.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 5f87f3c1 20-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't downgrade in traverse()

Downgrading of btree iterators is something that should only happen
explicitly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 5aab6635 14-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Tighten up btree_iter locking assertions

We weren't correctly verifying that we had interior node intent locks -
this patch also fixes bugs uncovered by the new assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# b00fde8f 05-Jul-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE

Add a new flag to control assertions about updating to internal snapshot
nodes, that normally should not be written to - to be used in an
upcoming patch.

Also do some renaming - trigger_flags is now update_flags.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# baa65029 27-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Change bch2_btree_key_cache_count() to exclude dirty keys

We're seeing livelocks that appear to be due to
bch2_btree_key_cache_scan repeatedly scanning and blocking other tasks
from using the key cache lock - we probably shouldn't be reporting
objects that can't actually be freed yet.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4932e07e 24-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix key cache assertion

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# bc2e5d5c 23-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix an out of bounds read

bch2_varint_decode() can read up to 7 bytes past the end of the buffer,
which means we need to allocate slightly larger key cache buffers.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f09517fc 20-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a deadlock on journal reclaim

Flushing the btree key cache needs to use allocation reserves - journal
reclaim depends on flushing the btree key cache for making forward
progress, and the allocator and copygc depend on journal reclaim making
forward progress.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 241e2636 31-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't flush btree writes more aggressively because of btree key cache

We need to flush the btree key cache when it's too dirty, because
otherwise the shrinker won't be able to reclaim memory - this is done by
journal reclaim. But journal reclaim also kicks btree node writes: this
meant that btree node writes were getting kicked much too often just
because we needed to flush btree key cache keys.

This patch splits journal pins into two different lists, and teaches
journal reclaim to not flush btree node writes when it only needs to
flush key cache keys.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f72b1fd7 04-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix a startup race

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2940295c 03-Apr-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Be more careful about JOURNAL_RES_GET_RESERVED

JOURNAL_RES_GET_RESERVED should only be used for updatse that need to be
done to free up space in the journal. In particular, when we're flushing
keys from the key cache, if we're flushing them out of order we
shouldn't be using it, since we're using up our remaining space in the
journal without dropping a pin that will let us make forward progress.

With this patch, BTREE_INSERT_JOURNAL_RECLAIM without
BTREE_INSERT_JOURNAL_RESERVED may return -EAGAIN - we can't wait on
journal reclaim if we're already in journal reclaim.

This means we need to propagate these errors up to journal reclaim,
indicating that flushing a journal pin should be retried in the future.

This is prep work for a patch to change the way journal reclaim works,
to split out flushing key cache keys because the btree key cache is too
dirty from journal reclaim because we need space in the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 4cf91b02 04-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Split out bpos_cmp() and bkey_cmp()

With snapshots, we're going to need to differentiate between comparisons
that should and shouldn't include the snapshot field. bpos_cmp is now
the comparison function that does include the snapshot field, used by
core btree code.

Upper level filesystem code generally does _not_ want to compare against
the snapshot field - that code wants keys to compare as equal even when
one of them is in an ancestor snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 331194a2 24-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree key cache locking improvements

The btree key cache mutex was becoming a significant bottleneck - it was
mainly used to protect the lists of dirty, clean and freed cached keys.

This patch eliminates the dirty and clean lists - instead, when we need
to scan for keys to drop from the cache we iterate over the rhashtable,
and thus we're able to remove most uses of that lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8d956c2f 19-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: btree_iter_set_dontneed()

This is a bit clearer than using bch2_btree_iter_free().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 53b3e3c0 08-Mar-2021 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix locking in bch2_btree_iter_traverse_cached()

bch2_btree_iter_traverse() is supposed to ensure we have the correct
type of lock - it was downgrading if necessary, but if we entered with a
read lock it wasn't upgrading to an intent lock, oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3187aa8d 21-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Don't use BTREE_INSERT_USE_RESERVE so much

Previously, we were using BTREE_INSERT_RESERVE in a lot of places where
it no longer makes sense.

- we now have more open_buckets than we used to, and the reserves work
better, so we shouldn't need to use BTREE_INSERT_RESERVE just because
we're holding open_buckets pinned anymore.

- We have the btree key cache for updates to the alloc btree, meaning
we no longer need the btree reserve to ensure the allocator can make
forward progress.

This means that we should only need a reserve for btree updates to
ensure that copygc can make forward progress.

Since it's now just for copygc, we can also fold RESERVE_BTREE into
RESERVE_MOVINGGC (the allocator's freelist reserve).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 1d8305c1 13-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add some cond_rescheds() in shutdown path

Particularly on emergency shutdown we can end up having to clean up a
lot of dirty cached btree keys here.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b206df6e 03-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix some spurious gcc warnings

These only come up when building in userspace, for some reason.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 3eb26d01 01-Dec-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: bch2_trans_get_iter() no longer returns errors

Since we now always preallocate the maximum number of iterators when we
initialize a btree transaction, getting an iterator never fails - we can
delete a fair amount of error path code.

This patch also simplifies the iterator allocation code a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d7b04163 30-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Change a BUG_ON() to a fatal error

In the btree key cache code, failing to flush a dirty key is a serious
error, but it doesn't need to be a BUG_ON(), we can stop the filesystem
instead.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d0022290 29-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix error in filesystem initialization

The rhashtable code doesn't like when we destroy an rhashtable that was
never initialized

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# b7a9bbfc 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Move journal reclaim to a kthread

This is to make tracing easier.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8a92e545 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Ensure journal reclaim runs when btree key cache is too dirty

Ensuring the key cache isn't too dirty is critical for ensuring that the
shrinker can reclaim memory.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 12590720 19-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve btree key cache shrinker

The shrinker should start scanning for entries that can be freed oldest
to newest - this way, we can avoid scanning a lot of entries that are
too new to be freed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 14ba3706 18-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a kmem_cache for btree_key_cache objects

We allocate a lot of these, and we're seeing sporading OOMs - this will
help with tracking those down.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 628a3ad2 12-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add a shrinker for the btree key cache

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# f526d26d 11-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix btree key cache shutdown

On emergency shutdown, we might still have dirty keys in the btree key
cache that need to be cleaned up properly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 6a747c46 09-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Add accounting for dirty btree nodes/keys

This lets us improve journal reclaim, so that it now tries to make sure
no more than 3/4s of the btree node cache and btree key cache are dirty
- ensuring the shrinkers can free memory.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 73e7470b 05-Nov-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: More inlinining in the btree key cache code

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# a301dc38 28-Oct-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Improve tracing for transaction restarts

We have a bug where we can get stuck with a process spinning in
transaction restarts - need more information.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 8cad3e2f 22-Sep-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Use cached iterators for inode updates

This switches inode updates to use cached btree iterators - which should
be a nice performance boost, since lock contention on the inodes btree
can be a bottleneck on multithreaded workloads.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# d211b408 15-Jun-2020 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Fix lock ordering with new btree cache code

The code that checks lock ordering was recently changed to go off of the
pos of the btree node, rather than the iterator, but the btree cache
code didn't update to handle iterators that point to cached bkeys. Oops

Also, update various debug code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>


# 2ca88e5a 07-Mar-2019 Kent Overstreet <kent.overstreet@gmail.com>

bcachefs: Btree key cache

This introduces a new kind of btree iterator, cached iterators, which
point to keys cached in a hash table. The cache also acts as a write
cache - in the update path, we journal the update but defer updating the
btree until the cached entry is flushed by journal reclaim.

Cache coherency is for now up to the users to handle, which isn't ideal
but should be good enough for now.

These new iterators will be used for updating inodes and alloc info (the
alloc and stripes btrees).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>