History log of /linux-master/drivers/md/bcache/btree.h
Revision Date Author Comments
# f0854489 15-Jun-2023 Mingzhe Zou <mingzhe.zou@easystack.cn>

bcache: fixup btree_cache_wait list damage

We get a kernel crash about "list_add corruption. next->prev should be
prev (ffff9c801bc01210), but was ffff9c77b688237c.
(next=ffffae586d8afe68)."

crash> struct list_head 0xffff9c801bc01210
struct list_head {
next = 0xffffae586d8afe68,
prev = 0xffffae586d8afe68
}
crash> struct list_head 0xffff9c77b688237c
struct list_head {
next = 0x0,
prev = 0x0
}
crash> struct list_head 0xffffae586d8afe68
struct list_head struct: invalid kernel virtual address: ffffae586d8afe68 type: "gdb_readmem_callback"
Cannot access memory at address 0xffffae586d8afe68

[230469.019492] Call Trace:
[230469.032041] prepare_to_wait+0x8a/0xb0
[230469.044363] ? bch_btree_keys_free+0x6c/0xc0 [escache]
[230469.056533] mca_cannibalize_lock+0x72/0x90 [escache]
[230469.068788] mca_alloc+0x2ae/0x450 [escache]
[230469.080790] bch_btree_node_get+0x136/0x2d0 [escache]
[230469.092681] bch_btree_check_thread+0x1e1/0x260 [escache]
[230469.104382] ? finish_wait+0x80/0x80
[230469.115884] ? bch_btree_check_recurse+0x1a0/0x1a0 [escache]
[230469.127259] kthread+0x112/0x130
[230469.138448] ? kthread_flush_work_fn+0x10/0x10
[230469.149477] ret_from_fork+0x35/0x40

bch_btree_check_thread() and bch_dirty_init_thread() may call
mca_cannibalize() to cannibalize other cached btree nodes. Only one thread
can do it at a time, so the op of other threads will be added to the
btree_cache_wait list.

We must call finish_wait() to remove op from btree_cache_wait before free
it's memory address. Otherwise, the list will be damaged. Also should call
bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up
other waiters.

Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded")
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Cc: stable@vger.kernel.org
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4c8a4924 09-May-2023 Kent Overstreet <kent.overstreet@linux.dev>

bcache: Convert to lock_cmp_fn

Replace one of bcache's lockdep_set_novalidate_class() usage with the
newly introduced custom lock nesting annotation.

[peterz: changelog]
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Coly Li <colyli@suse.de>
Link: https://lkml.kernel.org/r/20230509195847.1745548-2-kent.overstreet@linux.dev


# 62253644 24-May-2022 Coly Li <colyli@suse.de>

bcache: improve multithreaded bch_btree_check()

Commit 8e7102273f59 ("bcache: make bch_btree_check() to be
multithreaded") makes bch_btree_check() to be much faster when checking
all btree nodes during cache device registration. But it isn't in ideal
shap yet, still can be improved.

This patch does the following thing to improve current parallel btree
nodes check by multiple threads in bch_btree_check(),
- Add read lock to root node while checking all the btree nodes with
multiple threads. Although currently it is not mandatory but it is
good to have a read lock in code logic.
- Remove local variable 'char name[32]', and generate kernel thread name
string directly when calling kthread_run().
- Allocate local variable "struct btree_check_state check_state" on the
stack and avoid unnecessary dynamic memory allocation for it.
- Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed.
- Increase check_state->started to count created kernel thread after it
succeeds to create.
- When wait for all checking kernel threads to finish, use wait_event()
to replace wait_event_interruptible().

With this change, the code is more clear, and some potential error
conditions are avoided.

Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4a784266 01-Oct-2020 Coly Li <colyli@suse.de>

bcache: remove embedded struct cache_sb from struct cache_set

Since bcache code was merged into mainline kerrnel, each cache set only
as one single cache in it. The multiple caches framework is here but the
code is far from completed. Considering the multiple copies of cached
data can also be stored on e.g. md raid1 devices, it is unnecessary to
support multiple caches in one cache set indeed.

The previous preparation patches fix the dependencies of explicitly
making a cache set only have single cache. Now we don't have to maintain
an embedded partial super block in struct cache_set, the in-memory super
block can be directly referenced from struct cache.

This patch removes the embedded struct cache_sb from struct cache_set,
and fixes all locations where the superb lock was referenced from this
removed super block by referencing the in-memory super block of struct
cache.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 5ae3a2c0 24-Mar-2020 Coly Li <colyli@suse.de>

bcache: remove dupplicated declaration from btree.h

Commit 253a99d95d5b ("bcache: move macro btree() and btree_root()
into btree.h") makes two duplicated declaration into btree.h,
typedef int (btree_map_keys_fn)();
int bch_btree_map_keys();

The kbuild test robot <lkp@intel.com> detects and reports this
problem and this patch fixes it by removing the duplicated ones.

Fixes: 253a99d95d5b ("bcache: move macro btree() and btree_root() into btree.h")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8e710227 22-Mar-2020 Coly Li <colyli@suse.de>

bcache: make bch_btree_check() to be multithreaded

When registering a cache device, bch_btree_check() is called to check
all btree nodes, to make sure the btree is consistent and not
corrupted.

bch_btree_check() is recursively executed in a single thread, when there
are a lot of data cached and the btree is huge, it may take very long
time to check all the btree nodes. In my testing, I observed it took
around 50 minutes to finish bch_btree_check().

When checking the bcache btree nodes, the cache set is not running yet,
and indeed the whole tree is in read-only state, it is safe to create
multiple threads to check the btree in parallel.

This patch tries to create multiple threads, and each thread tries to
one-by-one check the sub-tree indexed by a key from the btree root node.
The parallel thread number depends on how many keys in the btree root
node. At most BCH_BTR_CHKTHREAD_MAX (64) threads can be created, but in
practice is should be min(cpu-number/2, root-node-keys-number).

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# feac1a70 22-Mar-2020 Coly Li <colyli@suse.de>

bcache: add bcache_ prefix to btree_root() and btree() macros

This patch changes macro btree_root() and btree() to bcache_btree_root()
and bcache_btree(), to avoid potential generic name clash in future.

NOTE: for product kernel maintainers, this patch can be skipped if
you feel the rename stuffs introduce inconvenince to patch backport.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 253a99d9 22-Mar-2020 Coly Li <colyli@suse.de>

bcache: move macro btree() and btree_root() into btree.h

In order to accelerate bcache registration speed, the macro btree()
and btree_root() will be referenced out of btree.c. This patch moves
them from btree.c into btree.h with other relative function declaration
in btree.h, for the following changes.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 125d98ed 23-Jan-2020 Coly Li <colyli@suse.de>

bcache: remove member accessed from struct btree

The member 'accessed' of struct btree is used in bch_mca_scan() when
shrinking btree node caches. The original idea is, if b->accessed is
set, clean it and look at next btree node cache from c->btree_cache
list, and only shrink the caches whose b->accessed is cleaned. Then
only cold btree node cache will be shrunk.

But when I/O pressure is high, it is very probably that b->accessed
of a btree node cache will be set again in bch_btree_node_get()
before bch_mca_scan() selects it again. Then there is no chance for
bch_mca_scan() to shrink enough memory back to slub or slab system.

This patch removes member accessed from struct btree, then once a
btree node ache is selected, it will be immediately shunk. By this
change, bch_mca_scan() may release btree node cahce more efficiently.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 50a260e8 28-Jun-2019 Coly Li <colyli@suse.de>

bcache: fix race in btree_flush_write()

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cb07ad63 13-Dec-2018 Coly Li <colyli@suse.de>

bcache: introduce force_wake_up_gc()

Garbage collection thread starts to work when c->sectors_to_gc is
negative value, otherwise nothing will happen even the gc thread is
woken up by wake_up_gc().

force_wake_up_gc() sets c->sectors_to_gc to -1 before calling
wake_up_gc(), then gc thread may have chance to run if no one else sets
c->sectors_to_gc to a positive value before gc_should_run().

This routine can be called where the gc thread is woken up and required
to run in force.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b0d30981 10-Aug-2018 Coly Li <colyli@suse.de>

bcache: style fixes for lines over 80 characters

This patch fixes the lines over 80 characters into more lines, to minimize
warnings by checkpatch.pl. There are still some lines exceed 80 characters,
but it is better to be a single line and I don't change them.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fc2d5988 10-Aug-2018 Coly Li <colyli@suse.de>

bcache: add identifier names to arguments of function definitions

There are many function definitions do not have identifier argument names,
scripts/checkpatch.pl complains warnings like this,

WARNING: function definition argument 'struct bcache_device *' should
also have an identifier name
#16735: FILE: writeback.h:120:
+void bch_sectors_dirty_init(struct bcache_device *);

This patch adds identifier argument names to all bcache function
definitions to fix such warnings.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6f10f7d1 10-Aug-2018 Coly Li <colyli@suse.de>

bcache: style fix to replace 'unsigned' by 'unsigned int'

This patch fixes warning reported by checkpatch.pl by replacing 'unsigned'
with 'unsigned int'.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# cbb751c0 09-Aug-2018 Shenghui Wang <shhuiw@foxmail.com>

bcache: trivial - remove tailing backslash in macro BTREE_FLAG

Remove the tailing backslash in macro BTREE_FLAG in btree.h

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# d44c2f9e 30-Oct-2017 Tang Junhui <tang.junhui@zte.com.cn>

bcache: update bucket_in_use in real time

bucket_in_use is updated in gc thread which triggered by invalidating or
writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
use it to compare with the threshold, it is often not timely, which leads
to inaccurate judgment and often results in bucket depletion.

We have send a patch before, by the means of updating bucket_in_use
periodically In gc thread, which Coly thought that would lead high
latency, In this patch, we add avail_nbuckets to record the count of
available buckets, and we calculate bucket_in_use when alloc or free
bucket in real time.

[edited by ML: eliminated some whitespace errors]

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ac6424b9 19-Jun-2017 Ingo Molnar <mingo@kernel.org>

sched/wait: Rename wait_queue_t => wait_queue_entry_t

Rename:

wait_queue_t => wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# be628be0 26-Oct-2016 Kent Overstreet <kent.overstreet@gmail.com>

bcache: Make gc wakeup sane, remove set_task_state()

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>


# 2452cc89 12-Jul-2014 Slava Pestov <sp@daterainc.com>

bcache: try to set b->parent properly

bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size
before; now it passes.

Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1


# c5aa4a31 21-Apr-2014 Slava Pestov <sp@daterainc.com>

bcache: wait for buckets when allocating new btree root

Tested:
- sometimes bcache_tier test would hang on startup with a failure
to allocate the btree root -- no longer seeing this

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 2531d9ee 17-Mar-2014 Kent Overstreet <kmo@daterainc.com>

bcache: Kill unused freelist

This was originally added as at optimization that for various reasons isn't
needed anymore, but it does add a lot of nasty corner cases (and it was
responsible for some recently fixed bugs). Just get rid of it now.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 0a63b66d 17-Mar-2014 Kent Overstreet <kmo@daterainc.com>

bcache: Rework btree cache reserve handling

This changes the bucket allocation reserves to use _real_ reserves - separate
freelists - instead of watermarks, which if nothing else makes the current code
saner to reason about and is going to be important in the future when we add
support for multiple btrees.

It also adds btree_check_reserve(), which checks (and locks) the reserves for
both bucket allocation and memory allocation for btree nodes; the old code just
kinda sorta assumed that since (e.g. for btree node splits) it had the root
locked and that meant no other threads could try to make use of the same
reserve; this technically should have been ok for memory allocation (we should
always have a reserve for memory allocation (the btree node cache is used as a
reserve and we preallocate it)), but multiple btrees will mean that locking the
root won't be sufficient anymore, and for the bucket allocation reserve it was
technically possible for the old code to deadlock.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 2a285686 04-Mar-2014 Kent Overstreet <kmo@daterainc.com>

bcache: btree locking rework

Add a new lock, b->write_lock, which is required to actually modify - or write -
a btree node; this lock is only held for short durations.

This means we can write out a btree node without taking b->lock, which _is_ held
for long durations - solving a deadlock when btree_flush_write() (from the
journalling code) is called with a btree node locked.

Right now just occurs in bch_btree_set_root(), but with an upcoming journalling
rework is going to happen a lot more.

This also turns b->lock is now more of a read/intent lock instead of a
read/write lock - but not completely, since it still blocks readers. May turn it
into a real intent lock at some point in the future.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 487dded8 17-Mar-2014 Kent Overstreet <kmo@daterainc.com>

bcache: Fix another bug recovering from unclean shutdown

The on disk bucket gens are allowed to be out of date, when we reuse buckets
that didn't have any live data in them. To deal with this, the initial gc has to
update the bucket gen when we find a pointer gen newer than the bucket's gen.

Unfortunately we weren't doing this for pointers in the journal that we're about
to replay.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# c052dd9a 11-Nov-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Convert btree_iter to struct btree_keys

More work to disentangle bset.c from struct btree

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# a85e968e 20-Dec-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Add struct btree_keys

Soon, bset.c won't need to depend on struct btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 65d45231 20-Dec-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Abstract out stuff needed for sorting

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# ee811287 18-Dec-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Rename/shuffle various code around

More work to disentangle bset.c from the rest of the code:

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 911c9610 28-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Split out sort_extent_cmp()

Only use extent comparison for comparing extents, so we're not using
START_KEY() on other key types (i.e. btree pointers)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 78b77bf8 17-Dec-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Btree verify code improvements

Used this fixed code to find and fix the bug fixed by
a4d885097b0ac0cd1337f171f2d4b83e946094d4.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 88b9f8c4 17-Dec-2013 Kent Overstreet <kmo@daterainc.com>

bcache: kill index()

That was a terrible name for a macro, add some better helpers to replace it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 78365411 17-Dec-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Rework allocator reserves

We need a reserve for allocating buckets for new btree nodes - and now that
we've got multiple btrees, it really needs to be per btree.

This reworks the reserves so we've got separate freelists for each reserve
instead of watermarks, which seems to make things a bit cleaner, and it adds
some code so that btree_split() can make sure the reserve is available before it
starts.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# cb7a583e 16-Dec-2013 Kent Overstreet <kmo@daterainc.com>

bcache: kill closure locking usage

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# bc9389ee 10-Sep-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Avoid deadlocking in garbage collection

Not a complete fix - we could still deadlock if btree_insert_node() has
to split...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# a1f0358b 10-Sep-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Incremental gc

Big garbage collection rewrite; now, garbage collection uses the same
mechanisms as used elsewhere for inserting/updating btree node pointers,
instead of rewriting interior btree nodes in place.

This makes the code significantly cleaner and less fragile, and means we
can now make garbage collection incremental - it doesn't have to hold a
write lock on the root of the btree for the entire duration of garbage
collection.

This means that there's less of a latency hit for doing garbage
collection, which means we can gc more frequently (and do a better job
of reclaiming from the cache), and we can coalesce across more btree
nodes (improving our space efficiency).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# d5cc66e9 25-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: bch_(btree|extent)_ptr_invalid()

Trying to treat btree pointers and leaf node pointers the same way was a
mistake - going to start being more explicit about the type of
key/pointer we're dealing with. This is the first part of that
refactoring; this patch shouldn't change any actual behaviour.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 3a3b6a4e 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Don't bother with bucket refcount for btree node allocations

The bucket refcount (dropped with bkey_put()) is only needed to prevent
the newly allocated bucket from being garbage collected until we've
added a pointer to it somewhere. But for btree node allocations, the
fact that we have btree nodes locked is enough to guard against races
with garbage collection.

Eventually the per bucket refcount is going to be replaced with
something specific to bch_alloc_sectors().

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 280481d0 24-Oct-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Debug code improvements

Couple changes:
* Consolidate bch_check_keys() and bch_check_key_order(), and move the
checks that only check_key_order() could do to bch_btree_iter_next().

* Get rid of CONFIG_BCACHE_EDEBUG - now, all that code is compiled in
when CONFIG_BCACHE_DEBUG is enabled, and there's now a sysfs file to
flip on the EDEBUG checks at runtime.

* Dropped an old not terribly useful check in rw_unlock(), and
refactored/improved a some of the other debug code.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# cc7b8819 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Convert bch_btree_insert() to bch_btree_map_leaf_nodes()

Last of the btree_map() conversions. Main visible effect is
bch_btree_insert() is no longer taking a struct btree_op as an argument
anymore - there's no fancy state machine stuff going on, it's just a
normal function.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 1b207d80 10-Sep-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Kill op->replace

This is prep work for converting bch_btree_insert to
bch_btree_map_leaf_nodes() - we have to convert all its arguments to
actual arguments. Bunch of churn, but should be straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# b54d6934 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Kill op->cl

This isn't used for waiting asynchronously anymore - so this is a fairly
trivial refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# c18536a7 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Prune struct btree_op

Eventual goal is for struct btree_op to contain only what is necessary
for traversing the btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 2c1953e2 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Convert bch_btree_read_async() to bch_btree_map_keys()

This is a fairly straightforward conversion, mostly reshuffling -
op->lookup_done goes away, replaced by MAP_DONE/MAP_CONTINUE. And the
code for handling cache hits and misses wasn't really btree code, so it
gets moved to request.c.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# df8e8970 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Move some stuff to btree.c

With the new btree_map() functions, we don't need to export the stuff
needed for traversing the btree anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 48dad8ba 10-Sep-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Add btree_map() functions

Lots of stuff has been open coding its own btree traversal - which is
generally pretty simple code, but there are a few subtleties.

This adds new new functions, bch_btree_map_nodes() and
bch_btree_map_keys(), which do the traversal for you. Everything that's
open coding btree traversal now (with the exception of garbage
collection) is slowly going to be converted to these two functions;
being able to write other code at a higher level of abstraction is a
big improvement w.r.t. overall code quality.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 72a44517 24-Oct-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Convert gc to a kthread

We needed a dedicated rescuer workqueue for gc anyways... and gc was
conceptually a dedicated thread, just one that wasn't running all the
time. Switch it to a dedicated thread to make the code a bit more
straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 35fcd848 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Convert bucket_wait to wait_queue_head_t

At one point we did do fancy asynchronous waiting stuff with
bucket_wait, but that's all gone (and bucket_wait is used a lot less
than it used to be). So use the standard primitives.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# e8e1d468 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Convert try_wait to wait_queue_head_t

We never waited on c->try_wait asynchronously, so just use the standard
primitives.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 0b93207a 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Move keylist out of btree_op

Slowly working on pruning struct btree_op - the aim is for it to only
contain things that are actually necessary for traversing the btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 84f0db03 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Refactor request_write()

Try to improve some of the naming a bit to be more consistent, and also
improve the flow of control in request_write() a bit.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 4f3d4014 10-Sep-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Add explicit keylist arg to btree_insert()

Some refactoring - better to explicitly pass stuff around instead of
having it all in the "big bag of state", struct btree_op. Going to prune
struct btree_op quite a bit over time.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# e7c590eb 10-Sep-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Convert btree_insert_check_key() to btree_insert_node()

This was the main point of all this refactoring - now,
btree_insert_check_key() won't fail just because the leaf node happened
to be full.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# d6fd3b11 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Explicitly track btree node's parent

This is prep work for the reworked btree insertion code.

The way we set b->parent is ugly and hacky... the problem is, when
btree_split() or garbage collection splits or rewrites a btree node, the
parent changes for all its (potentially already cached) children.

I may change this later and add some code to look through the btree node
cache and find all our cached child nodes and change the parent pointer
then...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 8304ad4d 24-Jul-2013 Kent Overstreet <kmo@daterainc.com>

bcache: Remove unnecessary check in should_split()

Checking i->seq was redundant, because since ages ago we always
initialize the new bset when advancing b->written

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# f3059a54 15-May-2013 Kent Overstreet <koverstreet@google.com>

bcache: Delete fuzz tester

This code has rotted and it hasn't been used in ages anyways.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>


# 72c27061 05-Jun-2013 Kent Overstreet <koverstreet@google.com>

bcache: Write out full stripes

Now that we're tracking dirty data per stripe, we can add two
optimizations for raid5/6:

* If a stripe is already dirty, force writes to that stripe to
writeback mode - to help build up full stripes of dirty data

* When flushing dirty data, preferentially write out full stripes first
if there are any.

Signed-off-by: Kent Overstreet <koverstreet@google.com>


# 85b1492e 14-May-2013 Kent Overstreet <koverstreet@google.com>

bcache: Rip out pkey()/pbtree()

Old gcc doesnt like the struct hack, and it is kind of ugly. So finish
off the work to convert pr_debug() statements to tracepoints, and delete
pkey()/pbtree().

Signed-off-by: Kent Overstreet <koverstreet@google.com>


# 57943511 25-Apr-2013 Kent Overstreet <koverstreet@google.com>

bcache: Refactor btree io

The most significant change is that btree reads are now done
synchronously, instead of asynchronously and doing the post read stuff
from a workqueue.

This was originally done because we can't block on IO under
generic_make_request(). But - we already have a mechanism to punt cache
lookups to workqueue if needed, so if we just use that we don't have to
deal with the complexity of doing things asynchronously.

The main benefit is this makes the locking situation saner; we can hold
our write lock on the btree node until we're finished reading it, and we
don't need that btree_node_read_done() flag anymore.

Also, for writes, btree_write() was broken out into btree_node_write()
and btree_leaf_dirty() - the old code with the boolean argument was dumb
and confusing.

The prio_blocked mechanism was improved a bit too, now the only counter
is in struct btree_write, we don't mess with transfering a count from
struct btree anymore.

This required changing garbage collection to block prios at the start
and unblock when it finishes, which is cleaner than what it was doing
anyways (the old code had mostly the same effect, but was doing it in a
convoluted way)

And the btree iter btree_node_read_done() uses was converted to a real
mempool.

Signed-off-by: Kent Overstreet <koverstreet@google.com>


# cafe5635 23-Mar-2013 Kent Overstreet <koverstreet@google.com>

bcache: A block layer cache

Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.

See the wiki at http://bcache.evilpiepirate.org for more.

Signed-off-by: Kent Overstreet <koverstreet@google.com>