Searched +hist:1 +hist:fb7e48d (Results 1 - 3 of 3) sorted by relevance

/linux-master/fs/xfs/
H A Dxfs_symlink.hdiff 1fb7e48d Mon Aug 12 04:49:40 MDT 2013 Dave Chinner <dchinner@redhat.com> xfs: split out the remote symlink handling

The remote symlink format definition and manipulation needs to be
shared with userspace, but the in-kernel interfaces do not. Split
the remote symlink format handling out into xfs_symlink_remote.[ch]
fo it can easily be shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
diff 1fb7e48d Mon Aug 12 04:49:40 MDT 2013 Dave Chinner <dchinner@redhat.com> xfs: split out the remote symlink handling

The remote symlink format definition and manipulation needs to be
shared with userspace, but the in-kernel interfaces do not. Split
the remote symlink format handling out into xfs_symlink_remote.[ch]
fo it can easily be shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
H A DMakefilediff 18a1e644 Thu Feb 22 01:43:40 MST 2024 Darrick J. Wong <djwong@kernel.org> xfs: define an in-memory btree for storing refcount bag info during repairs

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format. Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records. Let's say that the rmap records are:

{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}

It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory. An array or a
list could do that pretty easily. List suck because of the pointer
overhead. xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage. Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.

So we add the first rmap to the rcbag, and it looks like:

0: {agbno: 10, length: 40}

The refcount for agbno 10 is 1. Then we move on to block 11, so we add
the second rmap:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}

The refcount for agbno 11 is 2. We move on to block 12, so we add the
third:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}

The refcount for agbno 12 and 13 is 3. We move on to block 14, and
remove the second rmap:

0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}

The refcount for agbno 14 is 2. We move on to block 15, and add the
last rmap. But we don't care where it is and we don't want to expand
the array so we put it in slot 1:

0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}

The refcount for block 15 is 3. Notice how order doesn't matter in this
list? That's why repair uses an unordered list, or "bag". The data
structure is not a set because it does not guarantee uniqueness.

That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list. Overall,
the runtime is O(n^2) which is bad.

I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree. Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster. In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation. The repair code doesn't care.

(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 18a1e644 Thu Feb 22 01:43:40 MST 2024 Darrick J. Wong <djwong@kernel.org> xfs: define an in-memory btree for storing refcount bag info during repairs

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format. Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records. Let's say that the rmap records are:

{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}

It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory. An array or a
list could do that pretty easily. List suck because of the pointer
overhead. xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage. Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.

So we add the first rmap to the rcbag, and it looks like:

0: {agbno: 10, length: 40}

The refcount for agbno 10 is 1. Then we move on to block 11, so we add
the second rmap:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}

The refcount for agbno 11 is 2. We move on to block 12, so we add the
third:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}

The refcount for agbno 12 and 13 is 3. We move on to block 14, and
remove the second rmap:

0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}

The refcount for agbno 14 is 2. We move on to block 15, and add the
last rmap. But we don't care where it is and we don't want to expand
the array so we put it in slot 1:

0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}

The refcount for block 15 is 3. Notice how order doesn't matter in this
list? That's why repair uses an unordered list, or "bag". The data
structure is not a set because it does not guarantee uniqueness.

That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list. Overall,
the runtime is O(n^2) which is bad.

I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree. Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster. In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation. The repair code doesn't care.

(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 18a1e644 Thu Feb 22 01:43:40 MST 2024 Darrick J. Wong <djwong@kernel.org> xfs: define an in-memory btree for storing refcount bag info during repairs

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format. Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records. Let's say that the rmap records are:

{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}

It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory. An array or a
list could do that pretty easily. List suck because of the pointer
overhead. xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage. Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.

So we add the first rmap to the rcbag, and it looks like:

0: {agbno: 10, length: 40}

The refcount for agbno 10 is 1. Then we move on to block 11, so we add
the second rmap:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}

The refcount for agbno 11 is 2. We move on to block 12, so we add the
third:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}

The refcount for agbno 12 and 13 is 3. We move on to block 14, and
remove the second rmap:

0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}

The refcount for agbno 14 is 2. We move on to block 15, and add the
last rmap. But we don't care where it is and we don't want to expand
the array so we put it in slot 1:

0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}

The refcount for block 15 is 3. Notice how order doesn't matter in this
list? That's why repair uses an unordered list, or "bag". The data
structure is not a set because it does not guarantee uniqueness.

That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list. Overall,
the runtime is O(n^2) which is bad.

I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree. Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster. In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation. The repair code doesn't care.

(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 18a1e644 Thu Feb 22 01:43:40 MST 2024 Darrick J. Wong <djwong@kernel.org> xfs: define an in-memory btree for storing refcount bag info during repairs

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format. Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records. Let's say that the rmap records are:

{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}

It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory. An array or a
list could do that pretty easily. List suck because of the pointer
overhead. xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage. Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.

So we add the first rmap to the rcbag, and it looks like:

0: {agbno: 10, length: 40}

The refcount for agbno 10 is 1. Then we move on to block 11, so we add
the second rmap:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}

The refcount for agbno 11 is 2. We move on to block 12, so we add the
third:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}

The refcount for agbno 12 and 13 is 3. We move on to block 14, and
remove the second rmap:

0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}

The refcount for agbno 14 is 2. We move on to block 15, and add the
last rmap. But we don't care where it is and we don't want to expand
the array so we put it in slot 1:

0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}

The refcount for block 15 is 3. Notice how order doesn't matter in this
list? That's why repair uses an unordered list, or "bag". The data
structure is not a set because it does not guarantee uniqueness.

That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list. Overall,
the runtime is O(n^2) which is bad.

I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree. Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster. In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation. The repair code doesn't care.

(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 18a1e644 Thu Feb 22 01:43:40 MST 2024 Darrick J. Wong <djwong@kernel.org> xfs: define an in-memory btree for storing refcount bag info during repairs

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format. Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records. Let's say that the rmap records are:

{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}

It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory. An array or a
list could do that pretty easily. List suck because of the pointer
overhead. xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage. Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.

So we add the first rmap to the rcbag, and it looks like:

0: {agbno: 10, length: 40}

The refcount for agbno 10 is 1. Then we move on to block 11, so we add
the second rmap:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}

The refcount for agbno 11 is 2. We move on to block 12, so we add the
third:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}

The refcount for agbno 12 and 13 is 3. We move on to block 14, and
remove the second rmap:

0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}

The refcount for agbno 14 is 2. We move on to block 15, and add the
last rmap. But we don't care where it is and we don't want to expand
the array so we put it in slot 1:

0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}

The refcount for block 15 is 3. Notice how order doesn't matter in this
list? That's why repair uses an unordered list, or "bag". The data
structure is not a set because it does not guarantee uniqueness.

That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list. Overall,
the runtime is O(n^2) which is bad.

I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree. Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster. In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation. The repair code doesn't care.

(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 18a1e644 Thu Feb 22 01:43:40 MST 2024 Darrick J. Wong <djwong@kernel.org> xfs: define an in-memory btree for storing refcount bag info during repairs

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format. Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records. Let's say that the rmap records are:

{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}

It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory. An array or a
list could do that pretty easily. List suck because of the pointer
overhead. xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage. Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.

So we add the first rmap to the rcbag, and it looks like:

0: {agbno: 10, length: 40}

The refcount for agbno 10 is 1. Then we move on to block 11, so we add
the second rmap:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}

The refcount for agbno 11 is 2. We move on to block 12, so we add the
third:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}

The refcount for agbno 12 and 13 is 3. We move on to block 14, and
remove the second rmap:

0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}

The refcount for agbno 14 is 2. We move on to block 15, and add the
last rmap. But we don't care where it is and we don't want to expand
the array so we put it in slot 1:

0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}

The refcount for block 15 is 3. Notice how order doesn't matter in this
list? That's why repair uses an unordered list, or "bag". The data
structure is not a set because it does not guarantee uniqueness.

That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list. Overall,
the runtime is O(n^2) which is bad.

I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree. Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster. In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation. The repair code doesn't care.

(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 18a1e644 Thu Feb 22 01:43:40 MST 2024 Darrick J. Wong <djwong@kernel.org> xfs: define an in-memory btree for storing refcount bag info during repairs

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format. Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records. Let's say that the rmap records are:

{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}

It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory. An array or a
list could do that pretty easily. List suck because of the pointer
overhead. xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage. Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.

So we add the first rmap to the rcbag, and it looks like:

0: {agbno: 10, length: 40}

The refcount for agbno 10 is 1. Then we move on to block 11, so we add
the second rmap:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}

The refcount for agbno 11 is 2. We move on to block 12, so we add the
third:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}

The refcount for agbno 12 and 13 is 3. We move on to block 14, and
remove the second rmap:

0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}

The refcount for agbno 14 is 2. We move on to block 15, and add the
last rmap. But we don't care where it is and we don't want to expand
the array so we put it in slot 1:

0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}

The refcount for block 15 is 3. Notice how order doesn't matter in this
list? That's why repair uses an unordered list, or "bag". The data
structure is not a set because it does not guarantee uniqueness.

That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list. Overall,
the runtime is O(n^2) which is bad.

I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree. Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster. In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation. The repair code doesn't care.

(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 18a1e644 Thu Feb 22 01:43:40 MST 2024 Darrick J. Wong <djwong@kernel.org> xfs: define an in-memory btree for storing refcount bag info during repairs

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format. Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records. Let's say that the rmap records are:

{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}

It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory. An array or a
list could do that pretty easily. List suck because of the pointer
overhead. xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage. Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.

So we add the first rmap to the rcbag, and it looks like:

0: {agbno: 10, length: 40}

The refcount for agbno 10 is 1. Then we move on to block 11, so we add
the second rmap:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}

The refcount for agbno 11 is 2. We move on to block 12, so we add the
third:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}

The refcount for agbno 12 and 13 is 3. We move on to block 14, and
remove the second rmap:

0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}

The refcount for agbno 14 is 2. We move on to block 15, and add the
last rmap. But we don't care where it is and we don't want to expand
the array so we put it in slot 1:

0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}

The refcount for block 15 is 3. Notice how order doesn't matter in this
list? That's why repair uses an unordered list, or "bag". The data
structure is not a set because it does not guarantee uniqueness.

That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list. Overall,
the runtime is O(n^2) which is bad.

I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree. Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster. In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation. The repair code doesn't care.

(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 8660c7b7 Thu Feb 22 01:30:45 MST 2024 Darrick J. Wong <djwong@kernel.org> xfs: implement live inode scan for scrub

This patch implements a live file scanner for online fsck functions that
require the ability to walk a filesystem to gather metadata records and
stay informed about metadata changes to files that have already been
visited.

The iscan structure consists of two inode number cursors: one to track
which inode we want to visit next, and a second one to track which
inodes have already been visited. This second cursor is key to
capturing live updates to files previously scanned while the main thread
continues scanning -- any inode greater than this value hasn't been
scanned and can go on its way; any other update must be incorporated
into the collected data. It is critical for the scanning thraad to hold
exclusive access on the inode until after marking the inode visited.

This new code is a separate patch from the patchsets adding callers for
the sake of enabling the author to move patches around his tree with
ease. The intended usage model for this code is roughly:

xchk_iscan_start(iscan, 0, 0);
while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) {
xfs_ilock(ip, ...);
/* capture inode metadata */
xchk_iscan_mark_visited(iscan, ip);
xfs_iunlock(ip, ...);

xfs_irele(ip);
}
xchk_iscan_stop(iscan);
if (error)
return error;

Hook functions for live updates can then do:

if (xchk_iscan_want_live_update(...))
/* update the captured inode metadata */

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff be408417 Wed Dec 06 19:40:59 MST 2023 Darrick J. Wong <djwong@kernel.org> xfs: implement block reservation accounting for btrees we're staging

Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.

As for the particular choice of lowspace thresholds and btree block
slack factors -- at this point one could say that the thresholds in
online repair come from bulkload_estimate_ag_slack in xfs_repair[1].
But that's not the entire story, since the offline btree rebuilding
code in xfs_repair was merged as a retroport of the online btree code
in this patchset!

Before xfs_btree_staging.[ch] came along, xfs_repair determined the
slack factor (aka the number of slots to leave unfilled in each new
btree block) via open-coded logic in repair/phase5.c[2]. At that point
the slack factors were arbitrary quantities per btree. The rmapbt
automatically left 10 slots free; everything else left zero.

That had a noticeable effect on performance straight after mounting
because adding records to /any/ btree would result in splits. A few
years ago when this patch was first written, Dave and I decided that
repair should generate btree blocks that were 75% full unless space was
tight, in which case it should try to fill the blocks to nearly full.
We defined tight as ~10% free to avoid repair failures but settled on
3/32 (~9%) to avoid div64.

IOWs, we mostly pulled the thresholds out of thin air. We've been
QAing with those geometry numbers ever since. ;)

Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/bulkload.c?h=v6.5.0#n114
Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/phase5.c?h=v4.19.0#n1349
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
H A Dxfs_symlink.cdiff 83a21c18 Tue Mar 29 00:14:00 MDT 2022 Chandan Babu R <chandan.babu@oracle.com> xfs: Directory's data fork extent counter can never overflow

The maximum file size that can be represented by the data fork extent counter
in the worst case occurs when all extents are 1 block in length and each block
is 1KB in size.

With XFS_MAX_EXTCNT_DATA_FORK_SMALL representing maximum extent count and with
1KB sized blocks, a file can reach upto,
(2^31) * 1KB = 2TB

This is much larger than the theoretical maximum size of a directory
i.e. XFS_DIR2_SPACE_SIZE * 3 = ~96GB.

Since a directory's inode can never overflow its data fork extent counter,
this commit removes all the overflow checks associated with
it. xfs_dinode_verify() now performs a rough check to verify if a diretory's
data fork is larger than 96GB.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
diff 83a21c18 Tue Mar 29 00:14:00 MDT 2022 Chandan Babu R <chandan.babu@oracle.com> xfs: Directory's data fork extent counter can never overflow

The maximum file size that can be represented by the data fork extent counter
in the worst case occurs when all extents are 1 block in length and each block
is 1KB in size.

With XFS_MAX_EXTCNT_DATA_FORK_SMALL representing maximum extent count and with
1KB sized blocks, a file can reach upto,
(2^31) * 1KB = 2TB

This is much larger than the theoretical maximum size of a directory
i.e. XFS_DIR2_SPACE_SIZE * 3 = ~96GB.

Since a directory's inode can never overflow its data fork extent counter,
this commit removes all the overflow checks associated with
it. xfs_dinode_verify() now performs a rough check to verify if a diretory's
data fork is larger than 96GB.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
diff 83a21c18 Tue Mar 29 00:14:00 MDT 2022 Chandan Babu R <chandan.babu@oracle.com> xfs: Directory's data fork extent counter can never overflow

The maximum file size that can be represented by the data fork extent counter
in the worst case occurs when all extents are 1 block in length and each block
is 1KB in size.

With XFS_MAX_EXTCNT_DATA_FORK_SMALL representing maximum extent count and with
1KB sized blocks, a file can reach upto,
(2^31) * 1KB = 2TB

This is much larger than the theoretical maximum size of a directory
i.e. XFS_DIR2_SPACE_SIZE * 3 = ~96GB.

Since a directory's inode can never overflow its data fork extent counter,
this commit removes all the overflow checks associated with
it. xfs_dinode_verify() now performs a rough check to verify if a diretory's
data fork is larger than 96GB.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
diff 83a21c18 Tue Mar 29 00:14:00 MDT 2022 Chandan Babu R <chandan.babu@oracle.com> xfs: Directory's data fork extent counter can never overflow

The maximum file size that can be represented by the data fork extent counter
in the worst case occurs when all extents are 1 block in length and each block
is 1KB in size.

With XFS_MAX_EXTCNT_DATA_FORK_SMALL representing maximum extent count and with
1KB sized blocks, a file can reach upto,
(2^31) * 1KB = 2TB

This is much larger than the theoretical maximum size of a directory
i.e. XFS_DIR2_SPACE_SIZE * 3 = ~96GB.

Since a directory's inode can never overflow its data fork extent counter,
this commit removes all the overflow checks associated with
it. xfs_dinode_verify() now performs a rough check to verify if a diretory's
data fork is larger than 96GB.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
diff 38c26bfd Wed Aug 18 19:46:37 MDT 2021 Dave Chinner <dchinner@redhat.com> xfs: replace xfs_sb_version checks with feature flag checks

Convert the xfs_sb_version_hasfoo() to checks against
mp->m_features. Checks of the superblock itself during disk
operations (e.g. in the read/write verifiers and the to/from disk
formatters) are not converted - they operate purely on the
superblock state. Everything else should use the mount features.

Large parts of this conversion were done with sed with commands like
this:

for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
done

With manual cleanups for things like "xfs_has_extflgbit" and other
little inconsistencies in naming.

The result is ia lot less typing to check features and an XFS binary
size reduced by a bit over 3kB:

$ size -t fs/xfs/built-in.a
text data bss dec hex filenam
before 1130866 311352 484 1442702 16038e (TOTALS)
after 1127727 311352 484 1439563 15f74b (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
diff e6a688c3 Mon Mar 22 10:52:03 MDT 2021 Dave Chinner <dchinner@redhat.com> xfs: initialise attr fork on inode create

When we allocate a new inode, we often need to add an attribute to
the inode as part of the create. This can happen as a result of
needing to add default ACLs or security labels before the inode is
made visible to userspace.

This is highly inefficient right now. We do the create transaction
to allocate the inode, then we do an "add attr fork" transaction to
modify the just created empty inode to set the inode fork offset to
allow attributes to be stored, then we go and do the attribute
creation.

This means 3 transactions instead of 1 to allocate an inode, and
this greatly increases the load on the CIL commit code, resulting in
excessive contention on the CIL spin locks and performance
degradation:

18.99% [kernel] [k] __pv_queued_spin_lock_slowpath
3.57% [kernel] [k] do_raw_spin_lock
2.51% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
2.48% [kernel] [k] memcpy
2.34% [kernel] [k] xfs_log_commit_cil

The typical profile resulting from running fsmark on a selinux enabled
filesytem is adds this overhead to the create path:

- 15.30% xfs_init_security
- 15.23% security_inode_init_security
- 13.05% xfs_initxattrs
- 12.94% xfs_attr_set
- 6.75% xfs_bmap_add_attrfork
- 5.51% xfs_trans_commit
- 5.48% __xfs_trans_commit
- 5.35% xfs_log_commit_cil
- 3.86% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.70% xfs_trans_alloc
0.52% xfs_trans_reserve
- 5.41% xfs_attr_set_args
- 5.39% xfs_attr_set_shortform.constprop.0
- 4.46% xfs_trans_commit
- 4.46% __xfs_trans_commit
- 4.33% xfs_log_commit_cil
- 2.74% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
0.60% xfs_inode_item_format
0.90% xfs_attr_try_sf_addname
- 1.99% selinux_inode_init_security
- 1.02% security_sid_to_context_force
- 1.00% security_sid_to_context_core
- 0.92% sidtab_entry_to_string
- 0.90% sidtab_sid2str_get
0.59% sidtab_sid2str_put.part.0
- 0.82% selinux_determine_inode_label
- 0.77% security_transition_sid
0.70% security_compute_sid.part.0

And fsmark creation rate performance drops by ~25%. The key point to
note here is that half the additional overhead comes from adding the
attribute fork to the newly created inode. That's crazy, considering
we can do this same thing at inode create time with a couple of
lines of code and no extra overhead.

So, if we know we are going to add an attribute immediately after
creating the inode, let's just initialise the attribute fork inside
the create transaction and chop that whole chunk of code out of
the create fast path. This completely removes the performance
drop caused by enabling SELinux, and the profile looks like:

- 8.99% xfs_init_security
- 9.00% security_inode_init_security
- 6.43% xfs_initxattrs
- 6.37% xfs_attr_set
- 5.45% xfs_attr_set_args
- 5.42% xfs_attr_set_shortform.constprop.0
- 4.51% xfs_trans_commit
- 4.54% __xfs_trans_commit
- 4.59% xfs_log_commit_cil
- 2.67% _raw_spin_lock
- 3.28% do_raw_spin_lock
3.08% __pv_queued_spin_lock_slowpath
0.66% xfs_inode_item_format
- 0.90% xfs_attr_try_sf_addname
- 0.60% xfs_trans_alloc
- 2.35% selinux_inode_init_security
- 1.25% security_sid_to_context_force
- 1.21% security_sid_to_context_core
- 1.19% sidtab_entry_to_string
- 1.20% sidtab_sid2str_get
- 0.86% sidtab_sid2str_put.part.0
- 0.62% _raw_spin_lock_irqsave
- 0.77% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.84% selinux_determine_inode_label
- 0.83% security_transition_sid
0.86% security_compute_sid.part.0

Which indicates the XFS overhead of creating the selinux xattr has
been halved. This doesn't fix the CIL lock contention problem, just
means it's not a limiting factor for this workload. Lock contention
in the security subsystems is going to be an issue soon, though...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: fix compilation error when CONFIG_SECURITY=n]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
diff f5d92749 Fri Jan 22 17:48:12 MST 2021 Chandan Babu R <chandanrlinux@gmail.com> xfs: Check for extent overflow when adding dir entries

Directory entry addition can cause the following,
1. Data block can be added/removed.
A new extent can cause extent count to increase by 1.
2. Free disk block can be added/removed.
Same behaviour as described above for Data block.
3. Dabtree blocks.
XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these
can be new extents. Hence extent count can increase by
XFS_DA_NODE_MAXDEPTH.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff f5d92749 Fri Jan 22 17:48:12 MST 2021 Chandan Babu R <chandanrlinux@gmail.com> xfs: Check for extent overflow when adding dir entries

Directory entry addition can cause the following,
1. Data block can be added/removed.
A new extent can cause extent count to increase by 1.
2. Free disk block can be added/removed.
Same behaviour as described above for Data block.
3. Dabtree blocks.
XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these
can be new extents. Hence extent count can increase by
XFS_DA_NODE_MAXDEPTH.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 3565b660 Wed May 09 08:49:09 MDT 2018 Dave Chinner <dchinner@redhat.com> xfs: fix double ijoin in xfs_inactive_symlink_rmt()

xfs_inactive_symlink_rmt() does something nasty - it joins an inode
into a transaction it is already joined to. This means the inode can
have multiple log item descriptors attached to the transaction for
it. This breaks teh 1:1 mapping that is supposed to exist
between the log item and log item descriptor.

This results in the log item being processed twice during
transaction commit and CIL formatting, and there are lots of other
potential issues tha arise from double processing of log items in
the transaction commit state machine.

In this case, the inode is already held by the rolling transaction
returned from xfs_defer_finish(), so there's no need to join it
again.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 3565b660 Wed May 09 08:49:09 MDT 2018 Dave Chinner <dchinner@redhat.com> xfs: fix double ijoin in xfs_inactive_symlink_rmt()

xfs_inactive_symlink_rmt() does something nasty - it joins an inode
into a transaction it is already joined to. This means the inode can
have multiple log item descriptors attached to the transaction for
it. This breaks teh 1:1 mapping that is supposed to exist
between the log item and log item descriptor.

This results in the log item being processed twice during
transaction commit and CIL formatting, and there are lots of other
potential issues tha arise from double processing of log items in
the transaction commit state machine.

In this case, the inode is already held by the rolling transaction
returned from xfs_defer_finish(), so there's no need to join it
again.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Completed in 332 milliseconds