Searched +hist:4 +hist:fb6e8ad (Results 1 - 25 of 76) sorted by relevance

1234

/linux-master/fs/xfs/libxfs/
H A Dxfs_log_rlimit.cdiff 4ecf9e7c Mon Apr 25 19:38:14 MDT 2022 Darrick J. Wong <djwong@kernel.org> xfs: reduce the absurdly large log operation count

Back in the early days of reflink and rmap development I set the
transaction reservation sizes to be overly generous for rmap+reflink
filesystems, and a little under-generous for rmap-only filesystems.

Since we don't need *eight* transaction rolls to handle three new log
intent items, decrease the logcounts to what we actually need, and amend
the shadow reservation computation function to reflect what we used to
do so that the minimum log size doesn't change.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_dquot_buf.cdiff 4ea1ff3b Mon Aug 17 10:59:51 MDT 2020 Darrick J. Wong <darrick.wong@oracle.com> xfs: widen ondisk quota expiration timestamps to handle y2038+

Enable the bigtime feature for quota timers. We decrease the accuracy
of the timers to ~4s in exchange for being able to set timers up to the
bigtime maximum.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff 4ea1ff3b Mon Aug 17 10:59:51 MDT 2020 Darrick J. Wong <darrick.wong@oracle.com> xfs: widen ondisk quota expiration timestamps to handle y2038+

Enable the bigtime feature for quota timers. We decrease the accuracy
of the timers to ~4s in exchange for being able to set timers up to the
bigtime maximum.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_ialloc.hdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_symlink_remote.cdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_trans_resv.cdiff 4ecf9e7c Mon Apr 25 19:38:14 MDT 2022 Darrick J. Wong <djwong@kernel.org> xfs: reduce the absurdly large log operation count

Back in the early days of reflink and rmap development I set the
transaction reservation sizes to be overly generous for rmap+reflink
filesystems, and a little under-generous for rmap-only filesystems.

Since we don't need *eight* transaction rolls to handle three new log
intent items, decrease the logcounts to what we actually need, and amend
the shadow reservation computation function to reflect what we used to
do so that the minimum log size doesn't change.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff b1de6fc7 Wed Dec 11 14:19:07 MST 2019 Darrick J. Wong <darrick.wong@oracle.com> xfs: fix log reservation overflows when allocating large rt extents

Omar Sandoval reported that a 4G fallocate on the realtime device causes
filesystem shutdowns due to a log reservation overflow that happens when
we log the rtbitmap updates. Factor rtbitmap/rtsummary updates into the
the tr_write and tr_itruncate log reservation calculation.

"The following reproducer results in a transaction log overrun warning
for me:

mkfs.xfs -f -r rtdev=/dev/vdc -d rtinherit=1 -m reflink=0 /dev/vdb
mount -o rtdev=/dev/vdc /dev/vdb /mnt
fallocate -l 4G /mnt/foo

Reported-by: Omar Sandoval <osandov@osandov.com>
Tested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
diff b1de6fc7 Wed Dec 11 14:19:07 MST 2019 Darrick J. Wong <darrick.wong@oracle.com> xfs: fix log reservation overflows when allocating large rt extents

Omar Sandoval reported that a 4G fallocate on the realtime device causes
filesystem shutdowns due to a log reservation overflow that happens when
we log the rtbitmap updates. Factor rtbitmap/rtsummary updates into the
the tr_write and tr_itruncate log reservation calculation.

"The following reproducer results in a transaction log overrun warning
for me:

mkfs.xfs -f -r rtdev=/dev/vdc -d rtinherit=1 -m reflink=0 /dev/vdb
mount -o rtdev=/dev/vdc /dev/vdb /mnt
fallocate -l 4G /mnt/foo

Reported-by: Omar Sandoval <osandov@osandov.com>
Tested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_dir2_sf.cdiff 3f8a4f1d Thu Oct 17 14:40:33 MDT 2019 Dave Chinner <dchinner@redhat.com> xfs: fix inode fork extent count overflow

[commit message is verbose for discussion purposes - will trim it
down later. Some questions about implementation details at the end.]

Zorro Lang recently ran a new test to stress single inode extent
counts now that they are no longer limited by memory allocation.
The test was simply:

# xfs_io -f -c "falloc 0 40t" /mnt/scratch/big-file
# ~/src/xfstests-dev/punch-alternating /mnt/scratch/big-file

This test uncovered a problem where the hole punching operation
appeared to finish with no error, but apparently only created 268M
extents instead of the 10 billion it was supposed to.

Further, trying to punch out extents that should have been present
resulted in success, but no change in the extent count. It looked
like a silent failure.

While running the test and observing the behaviour in real time,
I observed the extent coutn growing at ~2M extents/minute, and saw
this after about an hour:

# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next ; \
> sleep 60 ; \
> xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 127657993
fsxattr.nextents = 129683339
#

And a few minutes later this:

# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 4177861124
#

Ah, what? Where did that 4 billion extra extents suddenly come from?

Stop the workload, unmount, mount:

# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 166044375
#

And it's back at the expected number. i.e. the extent count is
correct on disk, but it's screwed up in memory. I loaded up the
extent list, and immediately:

# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 4192576215
#

It's bad again. So, where does that number come from?
xfs_fill_fsxattr():

if (ip->i_df.if_flags & XFS_IFEXTENTS)
fa->fsx_nextents = xfs_iext_count(&ip->i_df);
else
fa->fsx_nextents = ip->i_d.di_nextents;

And that's the behaviour I just saw in a nutshell. The on disk count
is correct, but once the tree is loaded into memory, it goes whacky.
Clearly there's something wrong with xfs_iext_count():

inline xfs_extnum_t xfs_iext_count(struct xfs_ifork *ifp)
{
return ifp->if_bytes / sizeof(struct xfs_iext_rec);
}

Simple enough, but 134M extents is 2**27, and that's right about
where things went wrong. A struct xfs_iext_rec is 16 bytes in size,
which means 2**27 * 2**4 = 2**31 and we're right on target for an
integer overflow. And, sure enough:

struct xfs_ifork {
int if_bytes; /* bytes in if_u1 */
....

Once we get 2**27 extents in a file, we overflow if_bytes and the
in-core extent count goes wrong. And when we reach 2**28 extents,
if_bytes wraps back to zero and things really start to go wrong
there. This is where the silent failure comes from - only the first
2**28 extents can be looked up directly due to the overflow, all the
extents above this index wrap back to somewhere in the first 2**28
extents. Hence with a regular pattern, trying to punch a hole in the
range that didn't have holes mapped to a hole in the first 2**28
extents and so "succeeded" without changing anything. Hence "silent
failure"...

Fix this by converting if_bytes to a int64_t and converting all the
index variables and size calculations to use int64_t types to avoid
overflows in future. Signed integers are still used to enable easy
detection of extent count underflows. This enables scalability of
extent counts to the limits of the on-disk format - MAXEXTNUM
(2**31) extents.

Current testing is at over 500M extents and still going:

fsxattr.nextents = 517310478

Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 3f8a4f1d Thu Oct 17 14:40:33 MDT 2019 Dave Chinner <dchinner@redhat.com> xfs: fix inode fork extent count overflow

[commit message is verbose for discussion purposes - will trim it
down later. Some questions about implementation details at the end.]

Zorro Lang recently ran a new test to stress single inode extent
counts now that they are no longer limited by memory allocation.
The test was simply:

# xfs_io -f -c "falloc 0 40t" /mnt/scratch/big-file
# ~/src/xfstests-dev/punch-alternating /mnt/scratch/big-file

This test uncovered a problem where the hole punching operation
appeared to finish with no error, but apparently only created 268M
extents instead of the 10 billion it was supposed to.

Further, trying to punch out extents that should have been present
resulted in success, but no change in the extent count. It looked
like a silent failure.

While running the test and observing the behaviour in real time,
I observed the extent coutn growing at ~2M extents/minute, and saw
this after about an hour:

# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next ; \
> sleep 60 ; \
> xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 127657993
fsxattr.nextents = 129683339
#

And a few minutes later this:

# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 4177861124
#

Ah, what? Where did that 4 billion extra extents suddenly come from?

Stop the workload, unmount, mount:

# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 166044375
#

And it's back at the expected number. i.e. the extent count is
correct on disk, but it's screwed up in memory. I loaded up the
extent list, and immediately:

# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 4192576215
#

It's bad again. So, where does that number come from?
xfs_fill_fsxattr():

if (ip->i_df.if_flags & XFS_IFEXTENTS)
fa->fsx_nextents = xfs_iext_count(&ip->i_df);
else
fa->fsx_nextents = ip->i_d.di_nextents;

And that's the behaviour I just saw in a nutshell. The on disk count
is correct, but once the tree is loaded into memory, it goes whacky.
Clearly there's something wrong with xfs_iext_count():

inline xfs_extnum_t xfs_iext_count(struct xfs_ifork *ifp)
{
return ifp->if_bytes / sizeof(struct xfs_iext_rec);
}

Simple enough, but 134M extents is 2**27, and that's right about
where things went wrong. A struct xfs_iext_rec is 16 bytes in size,
which means 2**27 * 2**4 = 2**31 and we're right on target for an
integer overflow. And, sure enough:

struct xfs_ifork {
int if_bytes; /* bytes in if_u1 */
....

Once we get 2**27 extents in a file, we overflow if_bytes and the
in-core extent count goes wrong. And when we reach 2**28 extents,
if_bytes wraps back to zero and things really start to go wrong
there. This is where the silent failure comes from - only the first
2**28 extents can be looked up directly due to the overflow, all the
extents above this index wrap back to somewhere in the first 2**28
extents. Hence with a regular pattern, trying to punch a hole in the
range that didn't have holes mapped to a hole in the first 2**28
extents and so "succeeded" without changing anything. Hence "silent
failure"...

Fix this by converting if_bytes to a int64_t and converting all the
index variables and size calculations to use int64_t types to avoid
overflows in future. Signed integers are still used to enable easy
detection of extent count underflows. This enables scalability of
extent counts to the limits of the on-disk format - MAXEXTNUM
(2**31) extents.

Current testing is at over 500M extents and still going:

fsxattr.nextents = 517310478

Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 266b6969 Tue Jul 19 19:48:31 MDT 2016 Christoph Hellwig <hch@lst.de> xfs: kill xfs_dir2_inou_t

And use an array of unsigned char values directly to avoid problems
with architectures that pad the size of structures. This also gets
rid of the xfs_dir2_ino4_t and xfs_dir2_ino8_t types, and introduces
new constants for the size of 4 and 8 bytes as well as the size
difference between the two.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_ag.hdiff 4bdfd7d1 Fri Dec 15 11:03:32 MST 2023 Darrick J. Wong <djwong@kernel.org> xfs: repair free space btrees

Rebuild the free space btrees from the gaps in the rmap btree. Refer to
the case study in Documentation/filesystems/xfs-online-fsck-design.rst
for more details.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_dir2_block.cdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_dir2_data.cdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_attr_remote.cdiff 4ed8e27b Thu Jan 23 18:01:16 MST 2020 Darrick J. Wong <darrick.wong@oracle.com> xfs: make xfs_buf_read_map return an error code

Convert xfs_buf_read_map() to return numeric error codes like most
everywhere else in xfs. This involves moving the open-coded logic that
reports metadata IO read / corruption errors and stales the buffer into
xfs_buf_read_map so that the logic is all in one place.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff e8db2aaf Tue Jan 07 17:11:45 MST 2020 Darrick J. Wong <darrick.wong@oracle.com> xfs: fix memory corruption during remote attr value buffer invalidation

While running generic/103, I observed what looks like memory corruption
and (with slub debugging turned on) a slub redzone warning on i386 when
inactivating an inode with a 64k remote attr value.

On a v5 filesystem, maximally sized remote attr values require one block
more than 64k worth of space to hold both the remote attribute value
header (64 bytes). On a 4k block filesystem this results in a 68k
buffer; on a 64k block filesystem, this would be a 128k buffer. Note
that even though we'll never use more than 65,600 bytes of this buffer,
XFS_MAX_BLOCKSIZE is 64k.

This is a problem because the definition of struct xfs_buf_log_format
allows for XFS_MAX_BLOCKSIZE worth of dirty bitmap (64k). On i386 when we
invalidate a remote attribute, xfs_trans_binval zeroes all 68k worth of
the dirty map, writing right off the end of the log item and corrupting
memory. We've gotten away with this on x86_64 for years because the
compiler inserts a u32 padding on the end of struct xfs_buf_log_format.

Fortunately for us, remote attribute values are written to disk with
xfs_bwrite(), which is to say that they are not logged. Fix the problem
by removing all places where we could end up creating a buffer log item
for a remote attribute value and leave a note explaining why. Next,
replace the open-coded buffer invalidation with a call to the helper we
created in the previous patch that does better checking for bad metadata
before marking the buffer stale.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff e3c32ee9 Tue Jul 28 19:48:01 MDT 2015 Dave Chinner <dchinner@redhat.com> xfs: remote attribute headers contain an invalid LSN

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_rtbitmap.cdiff a6a38f30 Fri Dec 01 10:17:40 MST 2023 Darrick J. Wong <djwong@kernel.org> xfs: make rextslog computation consistent with mkfs

There's a weird discrepancy in xfsprogs dating back to the creation of
the Linux port -- if there are zero rt extents, mkfs will set
sb_rextents and sb_rextslog both to zero:

sbp->sb_rextslog =
(uint8_t)(rtextents ?
libxfs_highbit32((unsigned int)rtextents) : 0);

However, that's not the check that xfs_repair uses for nonzero rtblocks:

if (sb->sb_rextslog !=
libxfs_highbit32((unsigned int)sb->sb_rextents))

The difference here is that xfs_highbit32 returns -1 if its argument is
zero. Unfortunately, this means that in the weird corner case of a
realtime volume shorter than 1 rt extent, xfs_repair will immediately
flag a freshly formatted filesystem as corrupt. Because mkfs has been
writing ondisk artifacts like this for decades, we have to accept that
as "correct". TBH, zero rextslog for zero rtextents makes more sense to
me anyway.

Regrettably, the superblock verifier checks created in commit copied
xfs_repair even though mkfs has been writing out such filesystems for
ages. Fix the superblock verifier to accept what mkfs spits out; the
userspace version of this patch will have to fix xfs_repair as well.

Note that the new helper leaves the zeroday bug where the upper 32 bits
of sb_rextents is ripped off and fed to highbit32. This leads to a
seriously undersized rt summary file, which immediately breaks mkfs:

$ hugedisk.sh foo /dev/sdc $(( 0x100000080 * 4096))B
$ /sbin/mkfs.xfs -f /dev/sda -m rmapbt=0,reflink=0 -r rtdev=/dev/mapper/foo
meta-data=/dev/sda isize=512 agcount=4, agsize=1298176 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=0 bigtime=1 inobtcount=1 nrext64=1
data = bsize=4096 blocks=5192704, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =/dev/mapper/foo extsz=4096 blocks=4294967424, rtextents=4294967424
Discarding blocks...Done.
mkfs.xfs: Error initializing the realtime space [117 - Structure needs cleaning]

The next patch will drop support for rt volumes with fewer than 1 or
more than 2^32-1 rt extents, since they've clearly been broken forever.

Fixes: f8e566c0f5e1f ("xfs: validate the realtime geometry in xfs_validate_sb_common")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff e23aaf45 Mon Oct 16 11:41:55 MDT 2023 Omar Sandoval <osandov@fb.com> xfs: invert the realtime summary cache

In commit 355e3532132b ("xfs: cache minimum realtime summary level"), I
added a cache of the minimum level of the realtime summary that has any
free extents. However, it turns out that the _maximum_ level is more
useful for upcoming optimizations, and basically equivalent for the
existing usage. So, let's change the meaning of the cache to be the
maximum level + 1, or 0 if there are no free extents.

For example, if the cache contains:

{0, 4}

then there are no free extents starting in realtime bitmap block 0, and
there are no free extents larger than or equal to 2^4 blocks starting in
realtime bitmap block 1. The cache is a loose upper bound, so there may
or may not be free extents smaller than 2^4 blocks in realtime bitmap
block 1.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff e23aaf45 Mon Oct 16 11:41:55 MDT 2023 Omar Sandoval <osandov@fb.com> xfs: invert the realtime summary cache

In commit 355e3532132b ("xfs: cache minimum realtime summary level"), I
added a cache of the minimum level of the realtime summary that has any
free extents. However, it turns out that the _maximum_ level is more
useful for upcoming optimizations, and basically equivalent for the
existing usage. So, let's change the meaning of the cache to be the
maximum level + 1, or 0 if there are no free extents.

For example, if the cache contains:

{0, 4}

then there are no free extents starting in realtime bitmap block 0, and
there are no free extents larger than or equal to 2^4 blocks starting in
realtime bitmap block 1. The cache is a loose upper bound, so there may
or may not be free extents smaller than 2^4 blocks in realtime bitmap
block 1.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff e23aaf45 Mon Oct 16 11:41:55 MDT 2023 Omar Sandoval <osandov@fb.com> xfs: invert the realtime summary cache

In commit 355e3532132b ("xfs: cache minimum realtime summary level"), I
added a cache of the minimum level of the realtime summary that has any
free extents. However, it turns out that the _maximum_ level is more
useful for upcoming optimizations, and basically equivalent for the
existing usage. So, let's change the meaning of the cache to be the
maximum level + 1, or 0 if there are no free extents.

For example, if the cache contains:

{0, 4}

then there are no free extents starting in realtime bitmap block 0, and
there are no free extents larger than or equal to 2^4 blocks starting in
realtime bitmap block 1. The cache is a loose upper bound, so there may
or may not be free extents smaller than 2^4 blocks in realtime bitmap
block 1.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff d88850bd Tue Oct 13 09:46:27 MDT 2020 Darrick J. Wong <darrick.wong@oracle.com> xfs: fix high key handling in the rt allocator's query_range function

Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key
in the realtime bitmap is always one less than the number of rt extents,
which means that the key clamp at the start of the function is wrong.
The 4th argument to xfs_rtfind_forw is the highest rt extent that we
want to probe, which means that passing 1 less than the high key is
wrong. Finally, drop the rem variable that controls the loop because we
can compare the iteration point (rtstart) against the high key directly.

The sordid history of this function is that the original commit (fb3c3)
incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter
to xfs_rtfind_forw. This was wrong because the "high key" is supposed
to be the largest key for which the caller wants result rows, not the
key for the first row that could possibly be outside the range that the
caller wants to see.

A subsequent attempt (8ad56) to strengthen the parameter checking added
incorrect clamping of the parameters to the number of rt blocks in the
system (despite the bitmap functions all taking units of rt extents) to
avoid querying ranges past the end of rt bitmap file but failed to fix
the incorrect _rtfind_forw parameter. The original _rtfind_forw
parameter error then survived the conversion of the startblock and
blockcount fields to rt extents (a0e5c), and the most recent off-by-one
fix (a3a37) thought it was patching a problem when the end of the rt
volume is not in use, but none of these fixes actually solved the
original problem that the author was confused about the "limit" argument
to xfs_rtfind_forw.

Sadly, all four of these patches were written by this author and even
his own usage of this function and rt testing were inadequate to get
this fixed quickly.

Original-problem: fb3c3de2f65c ("xfs: add a couple of queries to iterate free extents in the rtbitmap")
Not-fixed-by: 8ad560d2565e ("xfs: strengthen rtalloc query range checks")
Not-fixed-by: a0e5c435babd ("xfs: fix xfs_rtalloc_rec units")
Fixes: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_dir2_leaf.cdiff 13cf24e0 Tue Oct 18 15:32:35 MDT 2022 Guo Xuenan <guoxuenan@huawei.com> xfs: fix exception caused by unexpected illegal bestcount in leaf dir

For leaf dir, In most cases, there should be as many bestfree slots
as the dir data blocks that can fit under i_size (except for [1]).

Root cause is we don't examin the number bestfree slots, when the slots
number less than dir data blocks, if we need to allocate new dir data
block and update the bestfree array, we will use the dir block number as
index to assign bestfree array, while we did not check the leaf buf
boundary which may cause UAF or other memory access problem. This issue
can also triggered with test cases xfs/473 from fstests.

According to Dave Chinner & Darrick's suggestion, adding buffer verifier
to detect this abnormal situation in time.
Simplify the testcase for fstest xfs/554 [1]

The error log is shown as follows:
==================================================================
BUG: KASAN: use-after-free in xfs_dir2_leaf_addname+0x1995/0x1ac0
Write of size 2 at addr ffff88810168b000 by task touch/1552
CPU: 5 PID: 1552 Comm: touch Not tainted 6.0.0-rc3+ #101
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
print_report.cold+0xf6/0x691
kasan_report+0xa8/0x120
xfs_dir2_leaf_addname+0x1995/0x1ac0
xfs_dir_createname+0x58c/0x7f0
xfs_create+0x7af/0x1010
xfs_generic_create+0x270/0x5e0
path_openat+0x270b/0x3450
do_filp_open+0x1cf/0x2b0
do_sys_openat2+0x46b/0x7a0
do_sys_open+0xb7/0x130
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fe4d9e9312b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
RDX: 0000000000000941 RSI: 00007ffda4c17f33 RDI: 00000000ffffff9c
RBP: 00007ffda4c17f33 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007fe4d9f631a4 R14: 00007ffda4c17f33 R15: 0000000000000000
</TASK>

The buggy address belongs to the physical page:
page:ffffea000405a2c0 refcount:0 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x10168b
flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff)
raw: 002fffff80000000 ffffea0004057788 ffffea000402dbc8 0000000000000000
raw: 0000000000000000 0000000000170000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88810168af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88810168af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88810168b000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88810168b080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810168b100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
00000000: 58 44 44 33 5b 53 35 c2 00 00 00 00 00 00 00 78
XDD3[S5........x
XFS (sdb): Internal error xfs_dir2_data_use_free at line 1200 of file
fs/xfs/libxfs/xfs_dir2_data.c. Caller
xfs_dir2_data_use_free+0x28a/0xeb0
CPU: 5 PID: 1552 Comm: touch Tainted: G B 6.0.0-rc3+
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
xfs_corruption_error+0x132/0x150
xfs_dir2_data_use_free+0x198/0xeb0
xfs_dir2_leaf_addname+0xa59/0x1ac0
xfs_dir_createname+0x58c/0x7f0
xfs_create+0x7af/0x1010
xfs_generic_create+0x270/0x5e0
path_openat+0x270b/0x3450
do_filp_open+0x1cf/0x2b0
do_sys_openat2+0x46b/0x7a0
do_sys_open+0xb7/0x130
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fe4d9e9312b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
RDX: 0000000000000941 RSI: 00007ffda4c17f46 RDI: 00000000ffffff9c
RBP: 00007ffda4c17f46 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007fe4d9f631a4 R14: 00007ffda4c17f46 R15: 0000000000000000
</TASK>
XFS (sdb): Corruption detected. Unmount and run xfs_repair

[1] https://lore.kernel.org/all/20220928095355.2074025-1-guoxuenan@huawei.com/
Reviewed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
diff 13cf24e0 Tue Oct 18 15:32:35 MDT 2022 Guo Xuenan <guoxuenan@huawei.com> xfs: fix exception caused by unexpected illegal bestcount in leaf dir

For leaf dir, In most cases, there should be as many bestfree slots
as the dir data blocks that can fit under i_size (except for [1]).

Root cause is we don't examin the number bestfree slots, when the slots
number less than dir data blocks, if we need to allocate new dir data
block and update the bestfree array, we will use the dir block number as
index to assign bestfree array, while we did not check the leaf buf
boundary which may cause UAF or other memory access problem. This issue
can also triggered with test cases xfs/473 from fstests.

According to Dave Chinner & Darrick's suggestion, adding buffer verifier
to detect this abnormal situation in time.
Simplify the testcase for fstest xfs/554 [1]

The error log is shown as follows:
==================================================================
BUG: KASAN: use-after-free in xfs_dir2_leaf_addname+0x1995/0x1ac0
Write of size 2 at addr ffff88810168b000 by task touch/1552
CPU: 5 PID: 1552 Comm: touch Not tainted 6.0.0-rc3+ #101
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
print_report.cold+0xf6/0x691
kasan_report+0xa8/0x120
xfs_dir2_leaf_addname+0x1995/0x1ac0
xfs_dir_createname+0x58c/0x7f0
xfs_create+0x7af/0x1010
xfs_generic_create+0x270/0x5e0
path_openat+0x270b/0x3450
do_filp_open+0x1cf/0x2b0
do_sys_openat2+0x46b/0x7a0
do_sys_open+0xb7/0x130
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fe4d9e9312b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
RDX: 0000000000000941 RSI: 00007ffda4c17f33 RDI: 00000000ffffff9c
RBP: 00007ffda4c17f33 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007fe4d9f631a4 R14: 00007ffda4c17f33 R15: 0000000000000000
</TASK>

The buggy address belongs to the physical page:
page:ffffea000405a2c0 refcount:0 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x10168b
flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff)
raw: 002fffff80000000 ffffea0004057788 ffffea000402dbc8 0000000000000000
raw: 0000000000000000 0000000000170000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88810168af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88810168af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88810168b000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88810168b080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810168b100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
00000000: 58 44 44 33 5b 53 35 c2 00 00 00 00 00 00 00 78
XDD3[S5........x
XFS (sdb): Internal error xfs_dir2_data_use_free at line 1200 of file
fs/xfs/libxfs/xfs_dir2_data.c. Caller
xfs_dir2_data_use_free+0x28a/0xeb0
CPU: 5 PID: 1552 Comm: touch Tainted: G B 6.0.0-rc3+
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
xfs_corruption_error+0x132/0x150
xfs_dir2_data_use_free+0x198/0xeb0
xfs_dir2_leaf_addname+0xa59/0x1ac0
xfs_dir_createname+0x58c/0x7f0
xfs_create+0x7af/0x1010
xfs_generic_create+0x270/0x5e0
path_openat+0x270b/0x3450
do_filp_open+0x1cf/0x2b0
do_sys_openat2+0x46b/0x7a0
do_sys_open+0xb7/0x130
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fe4d9e9312b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
RDX: 0000000000000941 RSI: 00007ffda4c17f46 RDI: 00000000ffffff9c
RBP: 00007ffda4c17f46 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007fe4d9f631a4 R14: 00007ffda4c17f46 R15: 0000000000000000
</TASK>
XFS (sdb): Corruption detected. Unmount and run xfs_repair

[1] https://lore.kernel.org/all/20220928095355.2074025-1-guoxuenan@huawei.com/
Reviewed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
diff 13cf24e0 Tue Oct 18 15:32:35 MDT 2022 Guo Xuenan <guoxuenan@huawei.com> xfs: fix exception caused by unexpected illegal bestcount in leaf dir

For leaf dir, In most cases, there should be as many bestfree slots
as the dir data blocks that can fit under i_size (except for [1]).

Root cause is we don't examin the number bestfree slots, when the slots
number less than dir data blocks, if we need to allocate new dir data
block and update the bestfree array, we will use the dir block number as
index to assign bestfree array, while we did not check the leaf buf
boundary which may cause UAF or other memory access problem. This issue
can also triggered with test cases xfs/473 from fstests.

According to Dave Chinner & Darrick's suggestion, adding buffer verifier
to detect this abnormal situation in time.
Simplify the testcase for fstest xfs/554 [1]

The error log is shown as follows:
==================================================================
BUG: KASAN: use-after-free in xfs_dir2_leaf_addname+0x1995/0x1ac0
Write of size 2 at addr ffff88810168b000 by task touch/1552
CPU: 5 PID: 1552 Comm: touch Not tainted 6.0.0-rc3+ #101
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
print_report.cold+0xf6/0x691
kasan_report+0xa8/0x120
xfs_dir2_leaf_addname+0x1995/0x1ac0
xfs_dir_createname+0x58c/0x7f0
xfs_create+0x7af/0x1010
xfs_generic_create+0x270/0x5e0
path_openat+0x270b/0x3450
do_filp_open+0x1cf/0x2b0
do_sys_openat2+0x46b/0x7a0
do_sys_open+0xb7/0x130
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fe4d9e9312b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
RDX: 0000000000000941 RSI: 00007ffda4c17f33 RDI: 00000000ffffff9c
RBP: 00007ffda4c17f33 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007fe4d9f631a4 R14: 00007ffda4c17f33 R15: 0000000000000000
</TASK>

The buggy address belongs to the physical page:
page:ffffea000405a2c0 refcount:0 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x10168b
flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff)
raw: 002fffff80000000 ffffea0004057788 ffffea000402dbc8 0000000000000000
raw: 0000000000000000 0000000000170000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88810168af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88810168af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88810168b000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88810168b080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810168b100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
00000000: 58 44 44 33 5b 53 35 c2 00 00 00 00 00 00 00 78
XDD3[S5........x
XFS (sdb): Internal error xfs_dir2_data_use_free at line 1200 of file
fs/xfs/libxfs/xfs_dir2_data.c. Caller
xfs_dir2_data_use_free+0x28a/0xeb0
CPU: 5 PID: 1552 Comm: touch Tainted: G B 6.0.0-rc3+
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
xfs_corruption_error+0x132/0x150
xfs_dir2_data_use_free+0x198/0xeb0
xfs_dir2_leaf_addname+0xa59/0x1ac0
xfs_dir_createname+0x58c/0x7f0
xfs_create+0x7af/0x1010
xfs_generic_create+0x270/0x5e0
path_openat+0x270b/0x3450
do_filp_open+0x1cf/0x2b0
do_sys_openat2+0x46b/0x7a0
do_sys_open+0xb7/0x130
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fe4d9e9312b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
RDX: 0000000000000941 RSI: 00007ffda4c17f46 RDI: 00000000ffffff9c
RBP: 00007ffda4c17f46 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007fe4d9f631a4 R14: 00007ffda4c17f46 R15: 0000000000000000
</TASK>
XFS (sdb): Corruption detected. Unmount and run xfs_repair

[1] https://lore.kernel.org/all/20220928095355.2074025-1-guoxuenan@huawei.com/
Reviewed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
diff 13cf24e0 Tue Oct 18 15:32:35 MDT 2022 Guo Xuenan <guoxuenan@huawei.com> xfs: fix exception caused by unexpected illegal bestcount in leaf dir

For leaf dir, In most cases, there should be as many bestfree slots
as the dir data blocks that can fit under i_size (except for [1]).

Root cause is we don't examin the number bestfree slots, when the slots
number less than dir data blocks, if we need to allocate new dir data
block and update the bestfree array, we will use the dir block number as
index to assign bestfree array, while we did not check the leaf buf
boundary which may cause UAF or other memory access problem. This issue
can also triggered with test cases xfs/473 from fstests.

According to Dave Chinner & Darrick's suggestion, adding buffer verifier
to detect this abnormal situation in time.
Simplify the testcase for fstest xfs/554 [1]

The error log is shown as follows:
==================================================================
BUG: KASAN: use-after-free in xfs_dir2_leaf_addname+0x1995/0x1ac0
Write of size 2 at addr ffff88810168b000 by task touch/1552
CPU: 5 PID: 1552 Comm: touch Not tainted 6.0.0-rc3+ #101
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
print_report.cold+0xf6/0x691
kasan_report+0xa8/0x120
xfs_dir2_leaf_addname+0x1995/0x1ac0
xfs_dir_createname+0x58c/0x7f0
xfs_create+0x7af/0x1010
xfs_generic_create+0x270/0x5e0
path_openat+0x270b/0x3450
do_filp_open+0x1cf/0x2b0
do_sys_openat2+0x46b/0x7a0
do_sys_open+0xb7/0x130
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fe4d9e9312b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
RDX: 0000000000000941 RSI: 00007ffda4c17f33 RDI: 00000000ffffff9c
RBP: 00007ffda4c17f33 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007fe4d9f631a4 R14: 00007ffda4c17f33 R15: 0000000000000000
</TASK>

The buggy address belongs to the physical page:
page:ffffea000405a2c0 refcount:0 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x10168b
flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff)
raw: 002fffff80000000 ffffea0004057788 ffffea000402dbc8 0000000000000000
raw: 0000000000000000 0000000000170000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88810168af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88810168af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88810168b000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88810168b080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810168b100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
00000000: 58 44 44 33 5b 53 35 c2 00 00 00 00 00 00 00 78
XDD3[S5........x
XFS (sdb): Internal error xfs_dir2_data_use_free at line 1200 of file
fs/xfs/libxfs/xfs_dir2_data.c. Caller
xfs_dir2_data_use_free+0x28a/0xeb0
CPU: 5 PID: 1552 Comm: touch Tainted: G B 6.0.0-rc3+
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
xfs_corruption_error+0x132/0x150
xfs_dir2_data_use_free+0x198/0xeb0
xfs_dir2_leaf_addname+0xa59/0x1ac0
xfs_dir_createname+0x58c/0x7f0
xfs_create+0x7af/0x1010
xfs_generic_create+0x270/0x5e0
path_openat+0x270b/0x3450
do_filp_open+0x1cf/0x2b0
do_sys_openat2+0x46b/0x7a0
do_sys_open+0xb7/0x130
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fe4d9e9312b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
RDX: 0000000000000941 RSI: 00007ffda4c17f46 RDI: 00000000ffffff9c
RBP: 00007ffda4c17f46 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007fe4d9f631a4 R14: 00007ffda4c17f46 R15: 0000000000000000
</TASK>
XFS (sdb): Corruption detected. Unmount and run xfs_repair

[1] https://lore.kernel.org/all/20220928095355.2074025-1-guoxuenan@huawei.com/
Reviewed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_dir2_node.cdiff 03754234 Mon Aug 24 18:05:13 MDT 2015 Jan Kara <jack@suse.com> xfs: Fix file type directory corruption for btree directories

Users have occasionally reported that file type for some directory
entries is wrong. This mostly happened after updating libraries some
libraries. After some debugging the problem was traced down to
xfs_dir2_node_replace(). The function uses args->filetype as a file type
to store in the replaced directory entry however it also calls
xfs_da3_node_lookup_int() which will store file type of the current
directory entry in args->filetype. Thus we fail to change file type of a
directory entry to a proper type.

Fix the problem by storing new file type in a local variable before
calling xfs_da3_node_lookup_int().

cc: <stable@vger.kernel.org> # 3.16 - 4.x
Reported-by: Giacomo Comes <comes@naic.edu>
Signed-off-by: Jan Kara <jack@suse.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
/linux-master/fs/xfs/
H A Dxfs_message.cdiff bdec055b Tue Nov 06 08:50:50 MST 2018 Darrick J. Wong <darrick.wong@oracle.com> xfs: print buffer offsets when dumping corrupt buffers

Use DUMP_PREFIX_OFFSET when printing hex dumps of corrupt buffers
because modern Linux now prints a 32-bit hash of our 64-bit pointer when
using DUMP_PREFIX_ADDRESS:

00000000b4bb4297: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00 ........;.......
00000005ec77e26: 00 00 00 00 02 d0 5a 00 00 00 00 00 00 00 00 00 ......Z.........
000000015938018: 21 98 e8 b4 fd de 4c 07 bc ea 3c e5 ae b4 7c 48 !.....L...<...|H

This is totally worthless for a sequential dump since we probably only
care about tracking the buffer offsets and afaik there's no way to
recover the actual pointer from the hashed value.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_extent_busy.cdiff 4bdfd7d1 Fri Dec 15 11:03:32 MST 2023 Darrick J. Wong <djwong@kernel.org> xfs: repair free space btrees

Rebuild the free space btrees from the gaps in the rmap btree. Refer to
the case study in Documentation/filesystems/xfs-online-fsck-design.rst
for more details.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
diff 4f0f586b Thu Apr 08 12:28:34 MDT 2021 Sami Tolvanen <samitolvanen@google.com> treewide: Change list_sort to use const pointers

list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.

Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4ecbfe63 Sun Apr 29 04:41:10 MDT 2012 Dave Chinner <dchinner@redhat.com> xfs: clean up busy extent naming

Now that the busy extent tracking has been moved out of the
allocation files, clean up the namespace it uses to
"xfs_extent_busy" rather than a mix of "xfs_busy" and
"xfs_alloc_busy".

Signed-off-by: Dave Chinner<dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
H A Dxfs_export.cdiff 9e9a2674 Mon Feb 08 22:54:58 MST 2016 Dave Chinner <dchinner@redhat.com> xfs: move inode generation count to VFS inode

Pull another 4 bytes out of the xfs_icdinode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_quotaops.cdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_icreate_item.cdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_trace.cdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_qm_bhv.cdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_ioctl32.cdiff 4d1b97f9 Fri Jan 07 18:45:51 MST 2022 Darrick J. Wong <djwong@kernel.org> xfs: kill the XFS_IOC_{ALLOC,FREE}SP* ioctls

According to the glibc compat header for Irix 4, these ioctls originated
in April 1991 as a (somewhat clunky) way to preallocate space at the end
of a file on an EFS filesystem. XFS, which was released in Irix 5.3 in
December 1993, picked up these ioctls to maintain compatibility and they
were ported to Linux in the early 2000s.

Recently it was pointed out to me they still lurk in the kernel, even
though the Linux fallocate syscall supplanted the functionality a long
time ago. fstests doesn't seem to include any real functional or stress
tests for these ioctls, which means that the code quality is ... very
questionable. Most notably, it was a stale disk block exposure vector
for 21 years and nobody noticed or complained. As mature programmers
say, "If you're not testing it, it's broken."

Given all that, let's withdraw these ioctls from the XFS userspace API.
Normally we'd set a long deprecation process, but I estimate that there
aren't any real users, so let's trigger a warning in dmesg and return
-ENOTTY.

See: CVE-2021-4155

Augments: 983d8e60f508 ("xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff 4d1b97f9 Fri Jan 07 18:45:51 MST 2022 Darrick J. Wong <djwong@kernel.org> xfs: kill the XFS_IOC_{ALLOC,FREE}SP* ioctls

According to the glibc compat header for Irix 4, these ioctls originated
in April 1991 as a (somewhat clunky) way to preallocate space at the end
of a file on an EFS filesystem. XFS, which was released in Irix 5.3 in
December 1993, picked up these ioctls to maintain compatibility and they
were ported to Linux in the early 2000s.

Recently it was pointed out to me they still lurk in the kernel, even
though the Linux fallocate syscall supplanted the functionality a long
time ago. fstests doesn't seem to include any real functional or stress
tests for these ioctls, which means that the code quality is ... very
questionable. Most notably, it was a stale disk block exposure vector
for 21 years and nobody noticed or complained. As mature programmers
say, "If you're not testing it, it's broken."

Given all that, let's withdraw these ioctls from the XFS userspace API.
Normally we'd set a long deprecation process, but I estimate that there
aren't any real users, so let's trigger a warning in dmesg and return
-ENOTTY.

See: CVE-2021-4155

Augments: 983d8e60f508 ("xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff 4d542e4c Wed Feb 26 18:30:28 MST 2020 Christoph Hellwig <hch@lst.de> xfs: reject invalid flags combinations in XFS_IOC_ATTRLIST_BY_HANDLE

While the flags field in the ABI and the on-disk format allows for
multiple namespace flags, an attribute can only exist in a single
namespace at a time. Hence asking to list attributes that exist
in multiple namespaces simultaneously is a logically invalid
request and will return no results. Reject this case early with
-EINVAL.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 4e247614 Sun Oct 11 23:02:56 MDT 2015 Jan Tulak <jtulak@redhat.com> xfs: prefix XATTR_LIST_MAX with XFS_

Remove a hard dependency of Linux XATTR_LIST_MAX value by using
a prefixed version. This patch reflects the same change in xfsprogs.

Signed-off-by: Jan Tulak <jtulak@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4d949021 Wed Jan 21 16:04:24 MST 2015 Brian Foster <bfoster@redhat.com> xfs: remove incorrect error negation in attr_multi ioctl

xfs_compat_attrmulti_by_handle() calls memdup_user() which returns a
negative error code. The error code is negated by the caller and thus
incorrectly converted to a positive error code.

Remove the error negation such that the negative error is passed
correctly back up to userspace.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_icache.hdiff 10be350b Fri Jun 18 12:57:06 MDT 2021 Darrick J. Wong <djwong@kernel.org> xfs: fix type mismatches in the inode reclaim functions

It's currently unlikely that we will ever end up with more than 4
billion inodes waiting for reclamation, but the fs object code uses long
int for object counts and we're certainly capable of generating that
many. Instead of truncating the internal counters, widen them and
report the object counts correctly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff 4d0bab3a Wed Jul 01 11:21:28 MDT 2020 Dave Chinner <dchinner@redhat.com> xfs: remove SYNC_WAIT from xfs_reclaim_inodes()

Clean up xfs_reclaim_inodes() callers. Most callers want blocking
behaviour, so just make the existing SYNC_WAIT behaviour the
default.

For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
directly because we just want optimistic clean inode reclaim to be
done in the background.

For xfs_quiesce_attr() we can just remove the inode reclaim calls as
they are a historic relic that was required to flush dirty inodes
that contained unlogged changes. We now log all changes to the
inodes, so the sync AIL push from xfs_log_quiesce() called by
xfs_quiesce_attr() will do all the required inode writeback for
freeze.

Seeing as we now want to loop until all reclaimable inodes have been
reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
were skipped. This is much more reliable and will always loop until
all reclaimable inodes are reclaimed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_discard.cdiff 2bf9d264 Fri Apr 12 08:39:21 MDT 2019 Wang Shilong <wshilong@ddn.com> xfs,fstrim: fix to return correct minlen

This patch tries to address two problems:

1) return @minlen we used to trim to
user space.

2) return EINVAL if granularity is larger than
avg size, even most of cases, granularity is small(4K),
but if devices return a lager granularity for some reaons
(testing, bugs etc), fstrim should return failure directly.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4ecbfe63 Sun Apr 29 04:41:10 MDT 2012 Dave Chinner <dchinner@redhat.com> xfs: clean up busy extent naming

Now that the busy extent tracking has been moved out of the
allocation files, clean up the namespace it uses to
"xfs_extent_busy" rather than a mix of "xfs_busy" and
"xfs_alloc_busy".

Signed-off-by: Dave Chinner<dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
H A Dxfs_dquot_item.cdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
H A Dxfs_xattr.cdiff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff 4fb6e8ad Thu Nov 27 20:25:04 MST 2014 Christoph Hellwig <hch@lst.de> xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

Completed in 768 milliseconds

1234