History log of /linux-master/fs/btrfs/free-space-cache.h
Revision Date Author Comments
# 602035d7 26-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: add forward declarations and headers, part 2

Do a cleanup in more headers:

- add forward declarations for types referenced by pointers
- add includes when types need them

This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.

Signed-off-by: David Sterba <dsterba@suse.com>


# 737e6e5f 05-Feb-2024 Lijuan Li <lilijuan@iscas.ac.cn>

btrfs: mark __btrfs_add_free_space static

__btrfs_add_free_space is only used in free-space-cache.c,
so mark it static.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 54d687c1 29-Apr-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c

This is completely related to block rsv's, move it out of the free space
cache code and into block-rsv.c.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cb9a10a6 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: convert discard stat defs to enum

Do away with the defines and use an enum as it's cleaner.

Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eda517fd 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move free space cachep's out of ctree.h

This is local to the free-space-cache.c code, remove it from ctree.h and
inode.c, create new init/exit functions for the cachep, and move it
locally to free-space-cache.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 390d89cc 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move discard stat defs to free-space-cache.h

These definitions are used for discard statistics, move them out of
ctree.h and put them in free-space-cache.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fc80f7ac 08-Aug-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove use btrfs_remove_free_space_cache instead of variant

We are calling __btrfs_remove_free_space_cache everywhere to cleanup the
block group free space, however we can just use
btrfs_remove_free_space_cache and pass in the block group in all of
these places. Then we can remove __btrfs_remove_free_space_cache and
rename __btrfs_remove_free_space_cache_locked to
__btrfs_remove_free_space_cache.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 364be842 23-Nov-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: change name and type of private member of btrfs_free_space_ctl

btrfs_free_space_ctl::private is either unset or it always points to
struct btrfs_block_group when it is set. So there's no point in keeping
the unhelpful 'private' name and keeping it an untyped pointer. Change
both the type and name to be self-describing. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 290ef19a 23-Nov-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: make __btrfs_add_free_space take just block group reference

There is no point in the function taking an fs_info and a
btrfs_free_space because the ctl passed always belongs to the block
group. Furthermore fs_info can be referenced from the block group. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 59c7b566 18-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: index free space entries on size

Currently we index free space on offset only, because usually we have a
hint from the allocator that we want to honor for locality reasons.
However if we fail to use this hint we have to go back to a brute force
search through the free space entries to find a large enough extent.

With sufficiently fragmented free space this becomes quite expensive, as
we have to linearly search all of the free space entries to find if we
have a part that's long enough.

To fix this add a cached rb tree to index based on free space entry
bytes. This will allow us to quickly look up the largest chunk in the
free space tree for this block group, and stop searching once we've
found an entry that is too small to satisfy our allocation. We simply
choose to use this tree if we're searching from the beginning of the
block group, as we know we do not care about locality at that point.

I wrote an allocator test that creates a 10TiB ram backed null block
device and then fallocates random files until the file system is full.
I think go through and delete all of the odd files. Then I spawn 8
threads that fallocate 64MiB files (1/2 our extent size cap) until the
file system is full again. I use bcc's funclatency to measure the
latency of find_free_extent. The baseline results are

nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 10356 |**** |
512 -> 1023 : 58242 |************************* |
1024 -> 2047 : 74418 |******************************** |
2048 -> 4095 : 90393 |****************************************|
4096 -> 8191 : 79119 |*********************************** |
8192 -> 16383 : 35614 |*************** |
16384 -> 32767 : 13418 |***** |
32768 -> 65535 : 12811 |***** |
65536 -> 131071 : 17090 |******* |
131072 -> 262143 : 26465 |*********** |
262144 -> 524287 : 40179 |***************** |
524288 -> 1048575 : 55469 |************************ |
1048576 -> 2097151 : 48807 |********************* |
2097152 -> 4194303 : 26744 |*********** |
4194304 -> 8388607 : 35351 |*************** |
8388608 -> 16777215 : 13918 |****** |
16777216 -> 33554431 : 21 | |

avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690

And the patch results are

nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 6883 |** |
512 -> 1023 : 54346 |********************* |
1024 -> 2047 : 79170 |******************************** |
2048 -> 4095 : 98890 |****************************************|
4096 -> 8191 : 81911 |********************************* |
8192 -> 16383 : 27075 |********** |
16384 -> 32767 : 14668 |***** |
32768 -> 65535 : 13251 |***** |
65536 -> 131071 : 15340 |****** |
131072 -> 262143 : 26715 |********** |
262144 -> 524287 : 43274 |***************** |
524288 -> 1048575 : 53870 |********************* |
1048576 -> 2097151 : 55368 |********************** |
2097152 -> 4194303 : 41036 |**************** |
4194304 -> 8388607 : 24927 |********** |
8388608 -> 16777215 : 33 | |
16777216 -> 33554431 : 9 | |

avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042

There's a little variation in the amount of calls done because of timing
of the threads with metadata requirements, but the avg, total, and
count's are relatively consistent between runs (usually within 2-5% of
each other). As you can see here we have around a 30% decrease in
average latency with a 30% decrease in overall time spent in
find_free_extent.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 169e0da9 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: track unusable bytes for zones

In a zoned filesystem a once written then freed region is not usable
until the underlying zone has been reset. So we need to distinguish such
unusable space from usable free space.

Therefore we need to introduce the "zone_unusable" field to the block
group structure, and "bytes_zone_unusable" to the space_info structure
to track the unusable space.

Pinned bytes are always reclaimed to the unusable space. But, when an
allocated region is returned before using e.g., the block group becomes
read-only between allocation time and reservation time, we can safely
return the region to the block group. For the situation, this commit
introduces "btrfs_add_free_space_unused". This behaves the same as
btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
rewinds the allocation offset.

Because the read-only bytes tracks free but unusable bytes when the block
group is read-only, we need to migrate the zone_unusable bytes to
read-only bytes when a block group is marked read-only.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 36b216c8 18-Nov-2020 Boris Burkov <boris@bur.io>

btrfs: remove free space items when disabling space cache v1

When the filesystem transitions from space cache v1 to v2 or to
nospace_cache, it removes the old cached data, but does not remove
the FREE_SPACE items nor the free space inodes they point to. This
doesn't cause any issues besides being a bit inefficient, since these
items no longer do anything useful.

To fix it, when we are mounting, and plan to disable the space cache,
destroy each block group's free space item and free space inode.
The code to remove the items is lifted from the existing use case of
removing the block group, with a light adaptation to handle whether or
not we have already looked up the free space inode.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 94846229 18-Nov-2020 Boris Burkov <boris@bur.io>

btrfs: keep sb cache_generation consistent with space_cache

When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.

In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.

We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).

Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# fa598b06 03-Dec-2020 David Sterba <dsterba@suse.com>

btrfs: remove recalc_thresholds from free space ops

After removing the inode number cache that was using the free space
cache code, we can remove at least the recalc_thresholds callback from
the ops. Both code and tests use the same callback function. It's moved
before its first use.

The use_bitmaps callback is still needed by tests to create some
extents/bitmap setup.

Signed-off-by: David Sterba <dsterba@suse.com>


# 7dbdb443 03-Dec-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: remove crc_check logic from free space

Following removal of the ino cache io_ctl_init will be called only on
behalf of the freespace inode. In this case we always want to check
CRCs so conditional code that depended on io_ctl::check_crc can be
removed.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5297199a 26-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: remove inode number cache feature

It's been deprecated since commit b547a88ea577 ("btrfs: start
deprecation of mount option inode_cache") which enumerates the reasons.

A filesystem that uses the feature (mount -o inode_cache) tracks the
inode numbers in bitmaps, that data stay on the filesystem after this
patch. The size is roughly 5MiB for 1M inodes [1], which is considered
small enough to be left there. Removal of the change can be implemented
in btrfs-progs if needed.

[1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# cd79909b 23-Oct-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: load free space cache into a temporary ctl

The free space cache has been special in that we would load it right
away instead of farming the work off to a worker thread. This resulted
in some weirdness that had to be taken into account for this fact,
namely that if we every found a block group being cached the fast way we
had to wait for it to finish, because we could get the cache before it
had been validated and we may throw the cache away.

To handle this particular case instead create a temporary
btrfs_free_space_ctl to load the free space cache into. Then once we've
validated that it makes sense, copy it's contents into the actual
block_group->free_space_ctl. This allows us to avoid the problems of
needing to wait for the caching to complete, we can clean up the discard
extent handling stuff in __load_free_space_cache, and we no longer need
to do the merge_space_tree() because the space is added one by one into
the real free_space_ctl. This will allow further reworks of how we
handle loading the free space cache.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 69b0e093 03-Jun-2020 Anand Jain <anand.jain@oracle.com>

btrfs: let btrfs_return_cluster_to_free_space() return void

__btrfs_return_cluster_to_free_space() returns only 0. And all its
parent functions don't need the return value either so make this a void
function.

Further, as none of the callers of btrfs_return_cluster_to_free_space()
is actually using the return from this function, make this function also
return void.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7fe6d45e 02-Jan-2020 Dennis Zhou <dennis@kernel.org>

btrfs: have multiple discard lists

Non-block group destruction discarding currently only had a single list
with no minimum discard length. This can lead to caravaning more
meaningful discards behind a heavily fragmented block group.

This adds support for multiple lists with minimum discard lengths to
prevent the caravan effect. We promote block groups back up when we
exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support
only 2 lists with filters of 1MB and 32KB respectively.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5dc7c10b 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: keep track of discardable_bytes for async discard

Keep track of this metric so that we can understand how ahead or behind
we are in discarding rate. This uses the same accounting method as
discardable_extents, deltas between previous/current values and
propagating them up.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# dfb79ddb 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: track discardable extents for async discard

The number of discardable extents will serve as the rate limiting metric
for how often we should discard. This keeps track of discardable extents
in the free space caches by maintaining deltas and propagating them to
the global count.

The deltas are calculated from 2 values stored in PREV and CURR entries,
then propagated up to the global discard ctl. The current counter value
becomes the previous counter value after update.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 2bee7eb8 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: discard one region at a time in async discard

The prior two patches added discarding via a background workqueue. This
just piggybacked off of the fstrim code to trim the whole block at once.
Well inevitably this is worse performance wise and will aggressively
overtrim. But it was nice to plumb the other infrastructure to keep the
patches easier to review.

This adds the real goal of this series which is discarding slowly (ie. a
slow long running fstrim). The discarding is split into two phases,
extents and then bitmaps. The reason for this is two fold. First, the
bitmap regions overlap the extent regions. Second, discarding the
extents first will let the newly trimmed bitmaps have the highest chance
of coalescing when being readded to the free space cache.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6e80d4f8 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: handle empty block_group removal for async discard

block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.

Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.

The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.

The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b0643e59 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: add the beginning of async discard, discard workqueue

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be addressed in a future patch of the
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
---------------------------------------------------------------
Drive A | 434 | 1170
Drive B | 880 | 2330
Drive C | 2943 | 3920
Drive D | 4763 | 5701

discard=async:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
--------------------------------------------------------------
Drive A | 134 | 956
Drive B | 64 | 1972
Drive C | 59 | 1032
Drive D | 62 | 1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# da080fe1 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: keep track of free space bitmap trim status cleanliness

There is a cap in btrfs in the amount of free extents that a block group
can have. When it surpasses that threshold, future extents are placed
into bitmaps. Instead of keeping track of if a certain bit is trimmed or
not in a second bitmap, keep track of the relative state of the bitmap.

With async discard, trimming bitmaps becomes a more frequent operation.
As a trade off with simplicity, we keep track of if discarding a bitmap
is in progress. If we fully scan a bitmap and trim as necessary, the
bitmap is marked clean. This has some caveats as the min block size may
skip over regions deemed too small. But this should be a reasonable
trade off rather than keeping a second bitmap and making allocation
paths more complex. The downside is we may overtrim, but ideally the min
block size should prevent us from doing that too often and getting stuck
trimming pathological cases.

BTRFS_TRIM_STATE_TRIMMING is added to indicate a bitmap is in the
process of being trimmed. If additional free space is added to that
bitmap, the bit is cleared. A bitmap will be marked
BTRFS_TRIM_STATE_TRIMMED if the trimming code was able to reach the end
of it and the former is still set.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a7ccb255 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: keep track of which extents have been discarded

Async discard will use the free space cache as backing knowledge for
which extents to discard. This patch plumbs knowledge about which
extents need to be discarded into the free space cache from
unpin_extent_range().

An untrimmed extent can merge with everything as this is a new region.
Absorbing trimmed extents is a tradeoff to for greater coalescing which
makes life better for find_free_extent(). Additionally, it seems the
size of a trim isn't as problematic as the trim io itself.

When reading in the free space cache from disk, if sync is set, mark all
extents as trimmed. The current code ensures at transaction commit that
all free space is trimmed when sync is set, so this reflects that.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 32da5386 29-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: rename btrfs_block_group_cache

The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.

Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 67b61aef 21-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: move struct io_ctl to free-space-cache.h

The io_ctl structure is used for free space management, and used only by
the v1 space cache code, but unfortunatlly the full definition is
required by block-group.h so it can't be moved to free-space-cache.c
without additional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 478b4d9f 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_add_free_space out of a header file

This is prep work for moving block_group_cache around. Having this in
the header file makes the header file include need to be in a certain
order, which is awkward, so just move it into free-space-cache.c and
then we can re-arrange later.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2ceeae2e 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from block group in btrfs_find_space_cluster

We can read fs_info from the block group cache structure and can drop it
from the parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# bb6cb1c5 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from block group in load_free_space_cache

We can read fs_info from the block group cache structure and can drop it
from the parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 7949f339 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from block group in lookup_free_space_inode

We can read fs_info from the block group cache structure and can drop it
from the parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# fe041534 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from trans in btrfs_write_out_cache

We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 4ca75f1b 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from trans in create_free_space_inode

We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 9888c340 03-Apr-2018 David Sterba <dsterba@suse.com>

btrfs: replace GPL boilerplate by SPDX -- headers

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Unify the include protection macros to match the file names.

Signed-off-by: David Sterba <dsterba@suse.com>


# 77ab86bf 15-Feb-2017 Jeff Mahoney <jeffm@suse.com>

btrfs: free-space-cache, clean up unnecessary root arguments

The free space cache APIs accept a root but always use the tree root.

Also, btrfs_truncate_free_space_cache accepts a root AND an inode but
the inode always points to the root anyway, so let's just pass the inode.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2ff7e61e 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: take an fs_info directly when the root is not used otherwise

There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# afdb5718 08-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: simplify btrfs_wait_cache_io prototype

With the exception of the one case where btrfs_wait_cache_io is called
without a block group, it's called with the same arguments. The root
argument is only used in the special case, so let's factor out the core
and simplify the call in the normal case to require a trans, block group,
and path.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5b4aacef 21-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: call functions that overwrite their root parameter with fs_info

There are 11 functions that accept a root parameter and immediately
overwrite it. We can pass those an fs_info pointer instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab8d0fc4 20-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: convert pr_* to btrfs_* where possible

For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 01327610 19-May-2016 Nicholas D Steeves <nsteeves@gmail.com>

btrfs: fix string and comment grammatical issues and typos

Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 20e5506b 19-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: constify remaining structs with function pointers

* struct extent_io_ops
* struct btrfs_free_space_op

Signed-off-by: David Sterba <dsterba@suse.com>


# cef40483 02-Oct-2015 Josef Bacik <jbacik@fb.com>

Btrfs: keep track of largest extent in bitmaps

We can waste a lot of time searching through bitmaps when we are heavily
fragmented trying to find large contiguous areas that don't exist in the bitmap.
So keep track of the max extent size when we do a full search of a bitmap so
that next time around we can just skip the expensive searching if our max size
is less than what we are looking for. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1bbc621e 06-Apr-2015 Chris Mason <clm@fb.com>

Btrfs: allow block group cache writeout outside critical section in commit

We loop through all of the dirty block groups during commit and write
the free space cache. In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.

If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.

This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit. We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.

Signed-off-by: Chris Mason <clm@fb.com>


# c9dc4c65 04-Apr-2015 Chris Mason <clm@fb.com>

Btrfs: two stage dirty block group writeout

Block group cache writeout is currently waiting on the pages for each
block group cache before moving on to writing the next one. This commit
switches things around to send down all the caches and then wait on them
in batches.

The end result is much faster, since we're keeping the disk pipeline
full.

Signed-off-by: Chris Mason <clm@fb.com>


# 55507ce3 01-Dec-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between writing free space cache and trimming

Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:

*** fsck.btrfs output ***
checking extents
checking free space cache
There is no free space entry for 521764864-521781248
There is no free space entry for 521764864-1103101952
cache appears valid but isnt 29360128
Checking filesystem on /dev/sdc
UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
found 1235681286 bytes used err is -22
(...)

Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:

[55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
[55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
[55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
[55650.807166] RIP: 0010:[<ffffffffa037cf45>] [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.807514] RSP: 0018:ffff88007153bc30 EFLAGS: 00010246
[55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
[55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
[55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
[55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
[55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
[55650.808017] FS: 0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
[55650.808017] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
[55650.808017] Stack:
[55650.808017] ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
[55650.808017] ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
[55650.808017] ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
[55650.808017] Call Trace:
[55650.808017] [<ffffffffa037d2fa>] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
[55650.808017] [<ffffffffa037d959>] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
[55650.808017] [<ffffffffa033936b>] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
[55650.808017] [<ffffffffa03aa98e>] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
[55650.808017] [<ffffffff813eb9c7>] ? _raw_spin_lock+0xe/0x10
[55650.808017] [<ffffffffa0346e46>] btrfs_commit_transaction+0x411/0x882 [btrfs]
[55650.808017] [<ffffffffa03432a4>] transaction_kthread+0xf2/0x1a4 [btrfs]
[55650.808017] [<ffffffffa03431b2>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[55650.808017] [<ffffffff8105966b>] kthread+0xb7/0xbf
[55650.808017] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[55650.808017] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
[55650.808017] RIP [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.808017] RSP <ffff88007153bc30>
[55650.815725] ---[ end trace 1c032e96b149ff86 ]---

Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 74514323 20-Sep-2013 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: remove path arg from btrfs_truncate_free_space_cache

Not used for anything, and removing it avoids caller's need to
allocate a path structure.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 53645a91 20-Sep-2013 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: remove duplicated ino cache's inode lookup

We're doing a unnecessary extra lookup of the ino cache's
inode when we already have it (and holding a reference)
during the process of saving the ino cache contents to disk.
Therefore remove this extra lookup.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# a4820398 08-Sep-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: allocate the free space by the existed max extent size when ENOSPC

By the current code, if the requested size is very large, and all the extents
in the free space cache are small, we will waste lots of the cpu time to cut
the requested size in half and search the cache again and again until it gets
down to the size the allocator can return. In fact, we can know the max extent
size in the cache after the first search, so we needn't cut the size in half
repeatedly, and just use the max extent size directly. This way can save
lots of cpu time and make the performance grow up when there are only fragments
in the free space cache.

According to my test, if there are only 4KB free space extents in the fs,
and the total size of those extents are 256MB, we can reduce the execute
time of the following test from 5.4s to 1.4s.
dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync

Changelog v2 -> v3:
- fix the problem that we skip the block group with the space which is
less than we need.

Changelog v1 -> v2:
- address the problem that we return a wrong start position when searching
the free space in a bitmap.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# dc11dd5d 14-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: separate out tests into their own directory

The plan is to have a bunch of unit tests that run when btrfs is loaded when you
build with the appropriate config option. My ultimate goal is to have a test
for every non-static function we have, but at first I'm going to focus on the
things that cause us the most problems. To start out with this just adds a
tests/ directory and moves the existing free space cache tests into that
directory and sets up all of the infrastructure. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 00361589 14-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: avoid starting a transaction in the write path

I noticed while looking at a deadlock that we are always starting a transaction
in cow_file_range(). This isn't really needed since we only need a transaction
if we are doing an inline extent, or if the allocator needs to allocate a chunk.
So push down all the transaction start stuff to be closer to where we actually
need a transaction in all of these cases. This will hopefully reduce our write
latency when we are committing often. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# e6d29605 30-Apr-2013 David Sterba <dsterba@suse.cz>

btrfs: move ifdef around sanity checks out of init_btrfs_fs

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 7b61cd92 13-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: don't use global block reservation for inode cache truncation

It is very likely that there are lots of subvolumes/snapshots in the filesystem,
so if we use global block reservation to do inode cache truncation, we may hog
all the free space that is reserved in global rsv. So it is better that we do
the free space reservation for inode cache truncation by ourselves.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 74255aa0 15-Mar-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: add some free space cache tests

We keep hitting bugs in the tree log replay because btrfs_remove_free_space
doesn't account for some corner case. So add a bunch of tests to try and fully
test btrfs_remove_free_space since the only time it is called is during tree log
replay. These tests all finish successfully, so as we find more of these bugs
we need to add to these tests to make sure we don't regress in fixing things.
I've hidden the tests behind a Kconfig option, but they take no time to run so
all btrfs developers should have this turned on all the time. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# f2a97a9d 04-May-2011 David Sterba <dsterba@suse.cz>

btrfs: remove all unused functions

Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

text data bss dec hex filename
402081 7464 200 409745 64091 btrfs.ko.base
398620 7144 200 405964 631cc btrfs.ko.remove-all

Signed-off-by: David Sterba <dsterba@suse.cz>


# 82d5902d 19-Apr-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Support reading/writing on disk free ino cache

This is similar to block group caching.

We dedicate a special inode in fs tree to save free ino cache.

At the very first time we create/delete a file after mount, the free ino
cache will be loaded from disk into memory. When the fs tree is commited,
the cache will be written back to disk.

To keep compatibility, we check the root generation against the generation
of the special inode when loading the cache, so the loading will fail
if the btrfs filesystem was mounted in an older kernel before.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 581bb050 19-Apr-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Cache free inode numbers in memory

Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.

This fixes it, and it works similarly to how we cache free space in block
cgroups.

We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.

Because we are searching the commit root, we have to carefully handle the
cross-transaction case.

The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 34d52cb6 28-Mar-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Make free space cache code generic

So we can re-use the code to cache free inode numbers.

The change is quite straightforward. Two new structures are introduced.

- struct btrfs_free_space_ctl

We move those variables that are used for caching free space from
struct btrfs_block_group_cache to this new struct.

- struct btrfs_free_space_op

We do block group specific work (e.g. calculation of extents threshold)
through functions registered in this struct.

And then we can remove references to struct btrfs_block_group_cache.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 92c42311 02-Mar-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Remove unused btrfs_block_group_free_space()

We've already recorded the value in block_group->frees_space.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# f7039b1d 24-Mar-2011 Li Dongyang <lidongyang@novell.com>

Btrfs: add btrfs_trim_fs() to handle FITRIM

We take an free extent out from allocator, trim it, then put it back,
but before we trim the block group, we should make sure the block group is
cached, so plus a little change to make cache_block_group() run without a
transaction.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9d66e233 25-Aug-2010 Josef Bacik <josef@redhat.com>

Btrfs: load free space cache if it exists

This patch actually loads the free space cache if it exists. The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it. Previously we did not do this since the caching
kthread would pick it up. With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group. This has been tested with all sorts of things.

Signed-off-by: Josef Bacik <josef@redhat.com>


# 0cb59c99 01-Jul-2010 Josef Bacik <josef@redhat.com>

Btrfs: write out free space cache

This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups. There are a bunch of changes
in inode.c in order to account for special cases. Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions. Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock. This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 0af3d00b 21-Jun-2010 Josef Bacik <josef@redhat.com>

Btrfs: create special free space cache inode

In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group. So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have. We
truncate and pre-allocate space everytime to make sure it's uptodate.

This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way. When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.

Signed-off-by: Josef Bacik <josef@redhat.com>


# 96303081 13-Jul-2009 Josef Bacik <josef@redhat.com>

Btrfs: use hybrid extents+bitmap rb tree for free space

Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.

This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.

Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.

I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.

I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.

Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 451d7585 09-Jun-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: add mount -o ssd_spread to spread allocations out

Some SSDs perform best when reusing block numbers often, while
others perform much better when clustering strictly allocates
big chunks of unused space.

The default mount -o ssd will find rough groupings of blocks
where there are a bunch of free blocks that might have some
allocated blocks mixed in.

mount -o ssd_spread will make sure there are no allocated blocks
mixed in. It should perform better on lower end SSDs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# fa9c0d79 03-Apr-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: rework allocation clustering

Because btrfs is copy-on-write, we end up picking new locations for
blocks very often. This makes it fairly difficult to maintain perfect
read patterns over time, but we can at least do some optimizations
for writes.

This is done today by remembering the last place we allocated and
trying to find a free space hole big enough to hold more than just one
allocation. The end result is that we tend to write sequentially to
the drive.

This happens all the time for metadata and it happens for data
when mounted -o ssd. But, the way we record it is fairly racey
and it tends to fragment the free space over time because we are trying
to allocate fairly large areas at once.

This commit gets rid of the races by adding a free space cluster object
with dedicated locking to make sure that only one process at a time
is out replacing the cluster.

The free space fragmentation is somewhat solved by allowing a cluster
to be comprised of smaller free space extents. This part definitely
adds some CPU time to the cluster allocations, but it allows the allocator
to consume the small holes left behind by cow.

Signed-off-by: Chris Mason <chris.mason@oracle.com>