Searched +hist:8 +hist:e8d7f13 (Results 1 - 5 of 5) sorted by relevance

/linux-master/fs/afs/
H A Dmain.cdiff 523d27cd Thu Feb 06 07:22:21 MST 2020 David Howells <dhowells@redhat.com> afs: Convert afs to use the new fscache API

Change the afs filesystem to support the new afs driver.

The following changes have been made:

(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole. There's also no longer a cell cookie.

(2) The volume cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). This function takes three parameters: a
string representing the "volume" in the index, a string naming the
cache to use (or NULL) and a u64 that conveys coherency metadata for
the volume.

For afs, I've made it render the volume name string as:

"afs,<cell>,<volume_id>"

and the coherency data is currently 0.

(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.

fscache_acquire_cookie() is passed the same keying and coherency
information as before, except that these are now stored in big endian
form instead of cpu endian. This makes the cache more copyable.

(4) fscache_use_cookie() and fscache_unuse_cookie() are called when a file
is opened or closed to prevent a cache file from being culled and to
keep resources to hand that are needed to do I/O.

fscache_use_cookie() is given an indication if the cache is likely to
be modified locally (e.g. the file is open for writing).

fscache_unuse_cookie() is given a coherency update if we had the file
open for writing and will update that.

(5) fscache_invalidate() is now given uptodate auxiliary data and a file
size. It can also take a flag to indicate if this was due to a DIO
write. This is wrapped into afs_fscache_invalidate() now for
convenience.

(6) fscache_resize() now gets called from the finalisation of
afs_setattr(), and afs_setattr() does use/unuse of the cookie around
the call to support this.

(7) fscache_note_page_release() is called from afs_release_page().

(8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
PG_fscache to be cleared.

Render the parts of the cookie key for an afs inode cookie as big endian.

Changes
=======
ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: kafs-testing@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819661382.215744.1485608824741611837.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970002.143852.17678518584089878259.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967174665.1823006.1301789965454084220.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021568841.640689.6684240152253400380.stgit@warthog.procyon.org.uk/ # v4
diff 8a070a96 Sat Apr 25 03:26:02 MDT 2020 David Howells <dhowells@redhat.com> afs: Detect cell aliases 1 - Cells with root volumes

Put in the first phase of cell alias detection. This part handles alias
detection for cells that have root.cell volumes (which is expected to be
likely).

When a cell becomes newly active, it is probed for its root.cell volume,
and if it has one, this volume is compared against other root.cell volumes
to find out if the list of fileserver UUIDs have any in common - and if
that's the case, do the address lists of those fileservers have any
addresses in common. If they do, the new cell is adjudged to be an alias
of the old cell and the old cell is used instead.

Comparing is aided by the server list in struct afs_server_list being
sorted in UUID order and the addresses in the fileserver address lists
being sorted in address order.

The cell then retains the afs_volume object for the root.cell volume, even
if it's not mounted for future alias checking.

This necessary because:

(1) Whilst fileservers have UUIDs that are meant to be globally unique, in
practice they are not because cells get cloned without changing the
UUIDs - so afs_server records need to be per cell.

(2) Sometimes the DNS is used to make cell aliases - but if we don't know
they're the same, we may end up with multiple superblocks and multiple
afs_server records for the same thing, impairing our ability to
deliver callback notifications of third party changes

(3) The fileserver RPC API doesn't contain the cell name, so it can't tell
us which cell it's notifying and can't see that a change made to to
one cell should notify the same client that's also accessed as the
other cell.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
diff d2ddc776 Thu Nov 02 09:27:50 MDT 2017 David Howells <dhowells@redhat.com> afs: Overhaul volume and server record caching and fileserver rotation

The current code assumes that volumes and servers are per-cell and are
never shared, but this is not enforced, and, indeed, public cells do exist
that are aliases of each other. Further, an organisation can, say, set up
a public cell and a private cell with overlapping, but not identical, sets
of servers. The difference is purely in the database attached to the VL
servers.

The current code will malfunction if it sees a server in two cells as it
assumes global address -> server record mappings and that each server is in
just one cell.

Further, each server may have multiple addresses - and may have addresses
of different families (IPv4 and IPv6, say).

To this end, the following structural changes are made:

(1) Server record management is overhauled:

(a) Server records are made independent of cell. The namespace keeps
track of them, volume records have lists of them and each vnode
has a server on which its callback interest currently resides.

(b) The cell record no longer keeps a list of servers known to be in
that cell.

(c) The server records are now kept in a flat list because there's no
single address to sort on.

(d) Server records are now keyed by their UUID within the namespace.

(e) The addresses for a server are obtained with the VL.GetAddrsU
rather than with VL.GetEntryByName, using the server's UUID as a
parameter.

(f) Cached server records are garbage collected after a period of
non-use and are counted out of existence before purging is allowed
to complete. This protects the work functions against rmmod.

(g) The servers list is now in /proc/fs/afs/servers.

(2) Volume record management is overhauled:

(a) An RCU-replaceable server list is introduced. This tracks both
servers and their coresponding callback interests.

(b) The superblock is now keyed on cell record and numeric volume ID.

(c) The volume record is now tied to the superblock which mounts it,
and is activated when mounted and deactivated when unmounted.
This makes it easier to handle the cache cookie without causing a
double-use in fscache.

(d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
to get the server UUID list.

(e) The volume name is updated if it is seen to have changed when the
volume is updated (the update is keyed on the volume ID).

(3) The vlocation record is got rid of and VLDB records are no longer
cached. Sufficient information is stored in the volume record, though
an update to a volume record is now no longer shared between related
volumes (volumes come in bundles of three: R/W, R/O and backup).

and the following procedural changes are made:

(1) The fileserver cursor introduced previously is now fleshed out and
used to iterate over fileservers and their addresses.

(2) Volume status is checked during iteration, and the server list is
replaced if a change is detected.

(3) Server status is checked during iteration, and the address list is
replaced if a change is detected.

(4) The abort code is saved into the address list cursor and -ECONNABORTED
returned in afs_make_call() if a remote abort happened rather than
translating the abort into an error message. This allows actions to
be taken depending on the abort code more easily.

(a) If a VMOVED abort is seen then this is handled by rechecking the
volume and restarting the iteration.

(b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
handled by sleeping for a short period and retrying and/or trying
other servers that might serve that volume. A message is also
displayed once until the condition has cleared.

(c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
moment.

(d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
see if it has been deleted; if not, the fileserver is probably
indicating that the volume couldn't be attached and needs
salvaging.

(e) If statfs() sees one of these aborts, it does not sleep, but
rather returns an error, so as not to block the umount program.

(5) The fileserver iteration functions in vnode.c are now merged into
their callers and more heavily macroised around the cursor. vnode.c
is removed.

(6) Operations on a particular vnode are serialised on that vnode because
the server will lock that vnode whilst it operates on it, so a second
op sent will just have to wait.

(7) Fileservers are probed with FS.GetCapabilities before being used.
This is where service upgrade will be done.

(8) A callback interest on a fileserver is set up before an FS operation
is performed and passed through to afs_make_call() so that it can be
set on the vnode if the operation returns a callback. The callback
interest is passed through to afs_iget() also so that it can be set
there too.

In general, record updating is done on an as-needed basis when we try to
access servers, volumes or vnodes rather than offloading it to work items
and special threads.

Notes:

(1) Pre AFS-3.4 servers are no longer supported, though this can be added
back if necessary (AFS-3.4 was released in 1998).

(2) VBUSY is retried forever for the moment at intervals of 1s.

(3) /proc/fs/afs/<cell>/servers no longer exists.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 8e8d7f13 Thu Jan 05 03:38:34 MST 2017 David Howells <dhowells@redhat.com> afs: Add some tracepoints

Add three tracepoints to the AFS filesystem:

(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().

(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.

(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.

To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.

Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 8e8d7f13 Thu Jan 05 03:38:34 MST 2017 David Howells <dhowells@redhat.com> afs: Add some tracepoints

Add three tracepoints to the AFS filesystem:

(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().

(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.

(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.

To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.

Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.

Signed-off-by: David Howells <dhowells@redhat.com>
H A Dcmservice.cdiff 6c881ca0 Tue Jun 15 04:57:26 MDT 2021 David Howells <dhowells@redhat.com> afs: Fix tracepoint string placement with built-in AFS

To quote Alexey[1]:

I was adding custom tracepoint to the kernel, grabbed full F34 kernel
.config, disabled modules and booted whole shebang as VM kernel.

Then did

perf record -a -e ...

It crashed:

general protection fault, probably for non-canonical address 0x435f5346592e4243: 0000 [#1] SMP PTI
CPU: 1 PID: 842 Comm: cat Not tainted 5.12.6+ #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
RIP: 0010:t_show+0x22/0xd0

Then reproducer was narrowed to

# cat /sys/kernel/tracing/printk_formats

Original F34 kernel with modules didn't crash.

So I started to disable options and after disabling AFS everything
started working again.

The root cause is that AFS was placing char arrays content into a
section full of _pointers_ to strings with predictable consequences.

Non canonical address 435f5346592e4243 is "CB.YFS_" which came from
CM_NAME macro.

Steps to reproduce:

CONFIG_AFS=y
CONFIG_TRACING=y

# cat /sys/kernel/tracing/printk_formats

Fix this by the following means:

(1) Add enum->string translation tables in the event header with the AFS
and YFS cache/callback manager operations listed by RPC operation ID.

(2) Modify the afs_cb_call tracepoint to print the string from the
translation table rather than using the string at the afs_call name
pointer.

(3) Switch translation table depending on the service we're being accessed
as (AFS or YFS) in the tracepoint print clause. Will this cause
problems to userspace utilities?

Note that the symbolic representation of the YFS service ID isn't
available to this header, so I've put it in as a number. I'm not sure
if this is the best way to do this.

(4) Remove the name wrangling (CM_NAME) macro and put the names directly
into the afs_call_type structs in cmservice.c.

Fixes: 8e8d7f13b6d5a9 ("afs: Add some tracepoints")
Reported-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YLAXfvZ+rObEOdc%2F@localhost.localdomain/ [1]
Link: https://lore.kernel.org/r/643721.1623754699@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162430903582.2896199.6098150063997983353.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162609463957.3133237.15916579353149746363.stgit@warthog.procyon.org.uk/ # v1 (repost)
Link: https://lore.kernel.org/r/162610726860.3408253.445207609466288531.stgit@warthog.procyon.org.uk/ # v2
diff 8e8d7f13 Thu Jan 05 03:38:34 MST 2017 David Howells <dhowells@redhat.com> afs: Add some tracepoints

Add three tracepoints to the AFS filesystem:

(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().

(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.

(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.

To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.

Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 8e8d7f13 Thu Jan 05 03:38:34 MST 2017 David Howells <dhowells@redhat.com> afs: Add some tracepoints

Add three tracepoints to the AFS filesystem:

(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().

(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.

(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.

To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.

Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 372ee163 Wed Aug 03 07:11:40 MDT 2016 David Howells <dhowells@redhat.com> rxrpc: Fix races between skb free, ACK generation and replying

Inside the kafs filesystem it is possible to occasionally have a call
processed and terminated before we've had a chance to check whether we need
to clean up the rx queue for that call because afs_send_simple_reply() ends
the call when it is done, but this is done in a workqueue item that might
happen to run to completion before afs_deliver_to_call() completes.

Further, it is possible for rxrpc_kernel_send_data() to be called to send a
reply before the last request-phase data skb is released. The rxrpc skb
destructor is where the ACK processing is done and the call state is
advanced upon release of the last skb. ACK generation is also deferred to
a work item because it's possible that the skb destructor is not called in
a context where kernel_sendmsg() can be invoked.

To this end, the following changes are made:

(1) kernel_rxrpc_data_consumed() is added. This should be called whenever
an skb is emptied so as to crank the ACK and call states. This does
not release the skb, however. kernel_rxrpc_free_skb() must now be
called to achieve that. These together replace
rxrpc_kernel_data_delivered().

(2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().

This makes afs_deliver_to_call() easier to work as the skb can simply
be discarded unconditionally here without trying to work out what the
return value of the ->deliver() function means.

The ->deliver() functions can, via afs_data_complete(),
afs_transfer_reply() and afs_extract_data() mark that an skb has been
consumed (thereby cranking the state) without the need to
conditionally free the skb to make sure the state is correct on an
incoming call for when the call processor tries to send the reply.

(3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
has finished with a packet and MSG_PEEK isn't set.

(4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().

Because of this, we no longer need to clear the destructor and put the
call before we free the skb in cases where we don't want the ACK/call
state to be cranked.

(5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
the delivery function already), and the caller is now responsible for
producing an abort if that was the last packet.

(6) There are many bits of unmarshalling code where:

ret = afs_extract_data(call, skb, last, ...);
switch (ret) {
case 0: break;
case -EAGAIN: return 0;
default: return ret;
}

is to be found. As -EAGAIN can now be passed back to the caller, we
now just return if ret < 0:

ret = afs_extract_data(call, skb, last, ...);
if (ret < 0)
return ret;

(7) Checks for trailing data and empty final data packets has been
consolidated as afs_data_complete(). So:

if (skb->len > 0)
return -EBADMSG;
if (!last)
return 0;

becomes:

ret = afs_data_complete(call, skb, last);
if (ret < 0)
return ret;

(8) afs_transfer_reply() now checks the amount of data it has against the
amount of data desired and the amount of data in the skb and returns
an error to induce an abort if we don't get exactly what we want.

Without these changes, the following oops can occasionally be observed,
particularly if some printks are inserted into the delivery path:

general protection fault: 0000 [#1] SMP
Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G E 4.7.0-fsdevel+ #1303
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Workqueue: kafsd afs_async_workfn [kafs]
task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
RIP: 0010:[<ffffffff8108fd3c>] [<ffffffff8108fd3c>] __lock_acquire+0xcf/0x15a1
RSP: 0018:ffff88040c073bc0 EFLAGS: 00010002
RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
FS: 0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
Stack:
0000000000000006 000000000be04930 0000000000000000 ffff880400000000
ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
Call Trace:
[<ffffffff8108f847>] ? mark_held_locks+0x5e/0x74
[<ffffffff81050446>] ? __local_bh_enable_ip+0x9b/0xa1
[<ffffffff8108f9ca>] ? trace_hardirqs_on_caller+0x16d/0x189
[<ffffffff810915f4>] lock_acquire+0x122/0x1b6
[<ffffffff810915f4>] ? lock_acquire+0x122/0x1b6
[<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
[<ffffffff81609dbf>] _raw_spin_lock_irqsave+0x35/0x49
[<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
[<ffffffff814c928f>] skb_dequeue+0x18/0x61
[<ffffffffa009aa92>] afs_deliver_to_call+0x344/0x39d [kafs]
[<ffffffffa009ab37>] afs_process_async_call+0x4c/0xd5 [kafs]
[<ffffffffa0099e9c>] afs_async_workfn+0xe/0x10 [kafs]
[<ffffffff81063a3a>] process_one_work+0x29d/0x57c
[<ffffffff81064ac2>] worker_thread+0x24a/0x385
[<ffffffff81064878>] ? rescuer_thread+0x2d0/0x2d0
[<ffffffff810696f5>] kthread+0xf3/0xfb
[<ffffffff8160a6ff>] ret_from_fork+0x1f/0x40
[<ffffffff81069602>] ? kthread_create_on_node+0x1cf/0x1cf

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
H A Drxrpc.cdiff 52bf9f6c Mon Dec 11 14:43:52 MST 2023 David Howells <dhowells@redhat.com> afs: Fix refcount underflow from error handling race

If an AFS cell that has an unreachable (eg. ENETUNREACH) server listed (VL
server or fileserver), an asynchronous probe to one of its addresses may
fail immediately because sendmsg() returns an error. When this happens, a
refcount underflow can happen if certain events hit a very small window.

The way this occurs is:

(1) There are two levels of "call" object, the afs_call and the
rxrpc_call. Each of them can be transitioned to a "completed" state
in the event of success or failure.

(2) Asynchronous afs_calls are self-referential whilst they are active to
prevent them from evaporating when they're not being processed. This
reference is disposed of when the afs_call is completed.

Note that an afs_call may only be completed once; once completed
completing it again will do nothing.

(3) When a call transmission is made, the app-side rxrpc code queues a Tx
buffer for the rxrpc I/O thread to transmit. The I/O thread invokes
sendmsg() to transmit it - and in the case of failure, it transitions
the rxrpc_call to the completed state.

(4) When an rxrpc_call is completed, the app layer is notified. In this
case, the app is kafs and it schedules a work item to process events
pertaining to an afs_call.

(5) When the afs_call event processor is run, it goes down through the
RPC-specific handler to afs_extract_data() to retrieve data from rxrpc
- and, in this case, it picks up the error from the rxrpc_call and
returns it.

The error is then propagated to the afs_call and that is completed
too. At this point the self-reference is released.

(6) If the rxrpc I/O thread manages to complete the rxrpc_call within the
window between rxrpc_send_data() queuing the request packet and
checking for call completion on the way out, then
rxrpc_kernel_send_data() will return the error from sendmsg() to the
app.

(7) Then afs_make_call() will see an error and will jump to the error
handling path which will attempt to clean up the afs_call.

(8) The problem comes when the error handling path in afs_make_call()
tries to unconditionally drop an async afs_call's self-reference.
This self-reference, however, may already have been dropped by
afs_extract_data() completing the afs_call

(9) The refcount underflows when we return to afs_do_probe_vlserver() and
that tries to drop its reference on the afs_call.

Fix this by making afs_make_call() attempt to complete the afs_call rather
than unconditionally putting it. That way, if afs_extract_data() manages
to complete the call first, afs_make_call() won't do anything.

The bug can be forced by making do_udp_sendmsg() return -ENETUNREACH and
sticking an msleep() in rxrpc_send_data() after the 'success:' label to
widen the race window.

The error message looks something like:

refcount_t: underflow; use-after-free.
WARNING: CPU: 3 PID: 720 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
...
RIP: 0010:refcount_warn_saturate+0xba/0x110
...
afs_put_call+0x1dc/0x1f0 [kafs]
afs_fs_get_capabilities+0x8b/0xe0 [kafs]
afs_fs_probe_fileserver+0x188/0x1e0 [kafs]
afs_lookup_server+0x3bf/0x3f0 [kafs]
afs_alloc_server_list+0x130/0x2e0 [kafs]
afs_create_volume+0x162/0x400 [kafs]
afs_get_tree+0x266/0x410 [kafs]
vfs_get_tree+0x25/0xc0
fc_mount+0xe/0x40
afs_d_automount+0x1b3/0x390 [kafs]
__traverse_mounts+0x8f/0x210
step_into+0x340/0x760
path_openat+0x13a/0x1260
do_filp_open+0xaf/0x160
do_sys_openat2+0xaf/0x170

or something like:

refcount_t: underflow; use-after-free.
...
RIP: 0010:refcount_warn_saturate+0x99/0xda
...
afs_put_call+0x4a/0x175
afs_send_vl_probes+0x108/0x172
afs_select_vlserver+0xd6/0x311
afs_do_cell_detect_alias+0x5e/0x1e9
afs_cell_detect_alias+0x44/0x92
afs_validate_fc+0x9d/0x134
afs_get_tree+0x20/0x2e6
vfs_get_tree+0x1d/0xc9
fc_mount+0xe/0x33
afs_d_automount+0x48/0x9d
__traverse_mounts+0xe0/0x166
step_into+0x140/0x274
open_last_lookups+0x1c1/0x1df
path_openat+0x138/0x1c3
do_filp_open+0x55/0xb4
do_sys_openat2+0x6c/0xb6

Fixes: 34fa47612bfe ("afs: Fix race in async call refcounting")
Reported-by: Bill MacAllister <bill@ca-zephyr.org>
Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1052304
Suggested-by: Jeffrey E Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/2633992.1702073229@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2c099014 Thu Nov 02 09:27:51 MDT 2017 David Howells <dhowells@redhat.com> afs: Trace the sending of pages

Add a pair of tracepoints to log the sending of pages for an FS.StoreData
or FS.StoreData64 operation.

Tracepoint afs_send_pages notes each set of pages added to the operation.
There may be several of these per operation as we get up at most 8
contiguous pages in one go because the bvec we're using is on the stack.

Tracepoint afs_sent_pages notes the end of adding data from a whole run of
pages to the operation and the completion of the request phase.

Signed-off-by: David Howells <dhowells@redhat.com>
diff d2ddc776 Thu Nov 02 09:27:50 MDT 2017 David Howells <dhowells@redhat.com> afs: Overhaul volume and server record caching and fileserver rotation

The current code assumes that volumes and servers are per-cell and are
never shared, but this is not enforced, and, indeed, public cells do exist
that are aliases of each other. Further, an organisation can, say, set up
a public cell and a private cell with overlapping, but not identical, sets
of servers. The difference is purely in the database attached to the VL
servers.

The current code will malfunction if it sees a server in two cells as it
assumes global address -> server record mappings and that each server is in
just one cell.

Further, each server may have multiple addresses - and may have addresses
of different families (IPv4 and IPv6, say).

To this end, the following structural changes are made:

(1) Server record management is overhauled:

(a) Server records are made independent of cell. The namespace keeps
track of them, volume records have lists of them and each vnode
has a server on which its callback interest currently resides.

(b) The cell record no longer keeps a list of servers known to be in
that cell.

(c) The server records are now kept in a flat list because there's no
single address to sort on.

(d) Server records are now keyed by their UUID within the namespace.

(e) The addresses for a server are obtained with the VL.GetAddrsU
rather than with VL.GetEntryByName, using the server's UUID as a
parameter.

(f) Cached server records are garbage collected after a period of
non-use and are counted out of existence before purging is allowed
to complete. This protects the work functions against rmmod.

(g) The servers list is now in /proc/fs/afs/servers.

(2) Volume record management is overhauled:

(a) An RCU-replaceable server list is introduced. This tracks both
servers and their coresponding callback interests.

(b) The superblock is now keyed on cell record and numeric volume ID.

(c) The volume record is now tied to the superblock which mounts it,
and is activated when mounted and deactivated when unmounted.
This makes it easier to handle the cache cookie without causing a
double-use in fscache.

(d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
to get the server UUID list.

(e) The volume name is updated if it is seen to have changed when the
volume is updated (the update is keyed on the volume ID).

(3) The vlocation record is got rid of and VLDB records are no longer
cached. Sufficient information is stored in the volume record, though
an update to a volume record is now no longer shared between related
volumes (volumes come in bundles of three: R/W, R/O and backup).

and the following procedural changes are made:

(1) The fileserver cursor introduced previously is now fleshed out and
used to iterate over fileservers and their addresses.

(2) Volume status is checked during iteration, and the server list is
replaced if a change is detected.

(3) Server status is checked during iteration, and the address list is
replaced if a change is detected.

(4) The abort code is saved into the address list cursor and -ECONNABORTED
returned in afs_make_call() if a remote abort happened rather than
translating the abort into an error message. This allows actions to
be taken depending on the abort code more easily.

(a) If a VMOVED abort is seen then this is handled by rechecking the
volume and restarting the iteration.

(b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
handled by sleeping for a short period and retrying and/or trying
other servers that might serve that volume. A message is also
displayed once until the condition has cleared.

(c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
moment.

(d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
see if it has been deleted; if not, the fileserver is probably
indicating that the volume couldn't be attached and needs
salvaging.

(e) If statfs() sees one of these aborts, it does not sleep, but
rather returns an error, so as not to block the umount program.

(5) The fileserver iteration functions in vnode.c are now merged into
their callers and more heavily macroised around the cursor. vnode.c
is removed.

(6) Operations on a particular vnode are serialised on that vnode because
the server will lock that vnode whilst it operates on it, so a second
op sent will just have to wait.

(7) Fileservers are probed with FS.GetCapabilities before being used.
This is where service upgrade will be done.

(8) A callback interest on a fileserver is set up before an FS operation
is performed and passed through to afs_make_call() so that it can be
set on the vnode if the operation returns a callback. The callback
interest is passed through to afs_iget() also so that it can be set
there too.

In general, record updating is done on an as-needed basis when we try to
access servers, volumes or vnodes rather than offloading it to work items
and special threads.

Notes:

(1) Pre AFS-3.4 servers are no longer supported, though this can be added
back if necessary (AFS-3.4 was released in 1998).

(2) VBUSY is retried forever for the moment at intervals of 1s.

(3) /proc/fs/afs/<cell>/servers no longer exists.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 8b2a464c Thu Nov 02 09:27:50 MDT 2017 David Howells <dhowells@redhat.com> afs: Add an address list concept

Add an RCU replaceable address list structure to hold a list of server
addresses. The list also holds the

To this end:

(1) A cell's VL server address list can be loaded directly via insmod or
echo to /proc/fs/afs/cells or dynamically from a DNS query for AFSDB
or SRV records.

(2) Anyone wanting to use a cell's VL server address must wait until the
cell record comes online and has tried to obtain some addresses.

(3) An FS server's address list, for the moment, has a single entry that
is the key to the server list. This will change in the future when a
server is instead keyed on its UUID and the VL.GetAddrsU operation is
used.

(4) An 'address cursor' concept is introduced to handle iteration through
the address list. This is passed to the afs_make_call() as, in the
future, stuff (such as abort code) that doesn't outlast the call will
be returned in it.

In the future, we might want to annotate the list with information about
how each address fares. We might then want to propagate such annotations
over address list replacement.

Whilst we're at it, we allow IPv6 addresses to be specified in
colon-delimited lists by enclosing them in square brackets.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 8e8d7f13 Thu Jan 05 03:38:34 MST 2017 David Howells <dhowells@redhat.com> afs: Add some tracepoints

Add three tracepoints to the AFS filesystem:

(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().

(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.

(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.

To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.

Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 8e8d7f13 Thu Jan 05 03:38:34 MST 2017 David Howells <dhowells@redhat.com> afs: Add some tracepoints

Add three tracepoints to the AFS filesystem:

(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().

(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.

(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.

To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.

Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 372ee163 Wed Aug 03 07:11:40 MDT 2016 David Howells <dhowells@redhat.com> rxrpc: Fix races between skb free, ACK generation and replying

Inside the kafs filesystem it is possible to occasionally have a call
processed and terminated before we've had a chance to check whether we need
to clean up the rx queue for that call because afs_send_simple_reply() ends
the call when it is done, but this is done in a workqueue item that might
happen to run to completion before afs_deliver_to_call() completes.

Further, it is possible for rxrpc_kernel_send_data() to be called to send a
reply before the last request-phase data skb is released. The rxrpc skb
destructor is where the ACK processing is done and the call state is
advanced upon release of the last skb. ACK generation is also deferred to
a work item because it's possible that the skb destructor is not called in
a context where kernel_sendmsg() can be invoked.

To this end, the following changes are made:

(1) kernel_rxrpc_data_consumed() is added. This should be called whenever
an skb is emptied so as to crank the ACK and call states. This does
not release the skb, however. kernel_rxrpc_free_skb() must now be
called to achieve that. These together replace
rxrpc_kernel_data_delivered().

(2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().

This makes afs_deliver_to_call() easier to work as the skb can simply
be discarded unconditionally here without trying to work out what the
return value of the ->deliver() function means.

The ->deliver() functions can, via afs_data_complete(),
afs_transfer_reply() and afs_extract_data() mark that an skb has been
consumed (thereby cranking the state) without the need to
conditionally free the skb to make sure the state is correct on an
incoming call for when the call processor tries to send the reply.

(3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
has finished with a packet and MSG_PEEK isn't set.

(4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().

Because of this, we no longer need to clear the destructor and put the
call before we free the skb in cases where we don't want the ACK/call
state to be cranked.

(5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
the delivery function already), and the caller is now responsible for
producing an abort if that was the last packet.

(6) There are many bits of unmarshalling code where:

ret = afs_extract_data(call, skb, last, ...);
switch (ret) {
case 0: break;
case -EAGAIN: return 0;
default: return ret;
}

is to be found. As -EAGAIN can now be passed back to the caller, we
now just return if ret < 0:

ret = afs_extract_data(call, skb, last, ...);
if (ret < 0)
return ret;

(7) Checks for trailing data and empty final data packets has been
consolidated as afs_data_complete(). So:

if (skb->len > 0)
return -EBADMSG;
if (!last)
return 0;

becomes:

ret = afs_data_complete(call, skb, last);
if (ret < 0)
return ret;

(8) afs_transfer_reply() now checks the amount of data it has against the
amount of data desired and the amount of data in the skb and returns
an error to induce an abort if we don't get exactly what we want.

Without these changes, the following oops can occasionally be observed,
particularly if some printks are inserted into the delivery path:

general protection fault: 0000 [#1] SMP
Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G E 4.7.0-fsdevel+ #1303
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Workqueue: kafsd afs_async_workfn [kafs]
task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
RIP: 0010:[<ffffffff8108fd3c>] [<ffffffff8108fd3c>] __lock_acquire+0xcf/0x15a1
RSP: 0018:ffff88040c073bc0 EFLAGS: 00010002
RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
FS: 0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
Stack:
0000000000000006 000000000be04930 0000000000000000 ffff880400000000
ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
Call Trace:
[<ffffffff8108f847>] ? mark_held_locks+0x5e/0x74
[<ffffffff81050446>] ? __local_bh_enable_ip+0x9b/0xa1
[<ffffffff8108f9ca>] ? trace_hardirqs_on_caller+0x16d/0x189
[<ffffffff810915f4>] lock_acquire+0x122/0x1b6
[<ffffffff810915f4>] ? lock_acquire+0x122/0x1b6
[<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
[<ffffffff81609dbf>] _raw_spin_lock_irqsave+0x35/0x49
[<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
[<ffffffff814c928f>] skb_dequeue+0x18/0x61
[<ffffffffa009aa92>] afs_deliver_to_call+0x344/0x39d [kafs]
[<ffffffffa009ab37>] afs_process_async_call+0x4c/0xd5 [kafs]
[<ffffffffa0099e9c>] afs_async_workfn+0xe/0x10 [kafs]
[<ffffffff81063a3a>] process_one_work+0x29d/0x57c
[<ffffffff81064ac2>] worker_thread+0x24a/0x385
[<ffffffff81064878>] ? rescuer_thread+0x2d0/0x2d0
[<ffffffff810696f5>] kthread+0xf3/0xfb
[<ffffffff8160a6ff>] ret_from_fork+0x1f/0x40
[<ffffffff81069602>] ? kthread_create_on_node+0x1cf/0x1cf

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
H A Dinternal.hdiff 495f2ae9 Wed Oct 18 02:24:01 MDT 2023 David Howells <dhowells@redhat.com> afs: Fix fileserver rotation

Fix the fileserver rotation so that it doesn't use RTT as the basis for
deciding which server and address to use as this doesn't necessarily give a
good indication of the best path. Instead, use the configurable preference
list in conjunction with whatever probes have succeeded at the time of
looking.

To this end, make the following changes:

(1) Keep an array of "server states" to track what addresses we've tried
on each server and move the waitqueue entries there that we'll need
for probing.

(2) Each afs_server_state struct is made to pin the corresponding server's
endpoint state rather than the afs_operation struct carrying a pin on
the server we're currently looking at.

(3) Drop the server list preference; we now always rescan the server list.

(4) afs_wait_for_probes() now uses the server state list to guide it in
what it waits for (and to provide the waitqueue entries) and returns
an indication of whether we'd got a response, run out of responsive
addresses or the endpoint state had been superseded and we need to
restart the iteration.

(5) Call afs_get_address_preferences*() occasionally to refresh the
preference values.

(6) When picking a server, scan the addresses of the servers for which we
have as-yet untested communications, looking for the highest priority
one and use that instead of trying all the addresses for a particular
server in ascending-RTT order.

(7) When a Busy or Offline state is seen across all available servers, do
a short sleep.

(8) If we detect that we accessed a future RO volume version whilst it is
undergoing replication, reissue the op against the older version until
at least half of the servers are replicated.

(9) Whilst RO replication is ongoing, increase the frequency of Volume
Location server checks for that volume to every ten minutes instead of
hourly.

Also add a tracepoint to track progress through the rotation algorithm.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
diff e6bace73 Thu Nov 02 10:26:59 MDT 2023 David Howells <dhowells@redhat.com> afs: Fix afs_server_list to be cleaned up with RCU

afs_server_list is accessed with the rcu_read_lock() held from
volume->servers, so it needs to be cleaned up correctly.

Fix this by using kfree_rcu() instead of kfree().

Fixes: 8a070a964877 ("afs: Detect cell aliases 1 - Cells with root volumes")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
diff bc899ee1 Tue Jun 29 15:37:05 MDT 2021 David Howells <dhowells@redhat.com> netfs: Add a netfs inode context

Add a netfs_i_context struct that should be included in the network
filesystem's own inode struct wrapper, directly after the VFS's inode
struct, e.g.:

struct my_inode {
struct {
/* These must be contiguous */
struct inode vfs_inode;
struct netfs_i_context netfs_ctx;
};
};

The netfs_i_context struct so far contains a single field for the network
filesystem to use - the cache cookie:

struct netfs_i_context {
...
struct fscache_cookie *cache;
};

Three functions are provided to help with this:

(1) void netfs_i_context_init(struct inode *inode,
const struct netfs_request_ops *ops);

Initialise the netfs context and set the operations.

(2) struct netfs_i_context *netfs_i_context(struct inode *inode);

Find the netfs context from the VFS inode.

(3) struct inode *netfs_inode(struct netfs_i_context *ctx);

Find the VFS inode from the netfs context.

Changes
=======
ver #4)
- Fix netfs_is_cache_enabled() to check cookie->cache_priv to see if a
cache is present[3].
- Fix netfs_skip_folio_read() to zero out all of the page, not just some
of it[3].

ver #3)
- Split out the bit to move ceph cap-getting on readahead into
ceph_init_request()[1].
- Stick in a comment to the netfs inode structs indicating the contiguity
requirements[2].

ver #2)
- Adjust documentation to match.
- Use "#if IS_ENABLED()" in netfs_i_cookie(), not "#ifdef".
- Move the cap check from ceph_readahead() to ceph_init_request() to be
called from netfslib.
- Remove ceph_readahead() and use netfs_readahead() directly instead.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com

Link: https://lore.kernel.org/r/8af0d47f17d89c06bbf602496dd845f2b0bf25b3.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/beaf4f6a6c2575ed489adb14b257253c868f9a5c.camel@kernel.org/ [2]
Link: https://lore.kernel.org/r/3536452.1647421585@warthog.procyon.org.uk/ [3]
Link: https://lore.kernel.org/r/164622984545.3564931.15691742939278418580.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/164678213320.1200972.16807551936267647470.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/164692909854.2099075.9535537286264248057.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/306388.1647595110@warthog.procyon.org.uk/ # v4
diff 8fb72b4a Wed Feb 09 13:22:01 MST 2022 Matthew Wilcox (Oracle) <willy@infradead.org> fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()

Convert all users of fscache_set_page_dirty to use fscache_dirty_folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
diff 523d27cd Thu Feb 06 07:22:21 MST 2020 David Howells <dhowells@redhat.com> afs: Convert afs to use the new fscache API

Change the afs filesystem to support the new afs driver.

The following changes have been made:

(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole. There's also no longer a cell cookie.

(2) The volume cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). This function takes three parameters: a
string representing the "volume" in the index, a string naming the
cache to use (or NULL) and a u64 that conveys coherency metadata for
the volume.

For afs, I've made it render the volume name string as:

"afs,<cell>,<volume_id>"

and the coherency data is currently 0.

(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.

fscache_acquire_cookie() is passed the same keying and coherency
information as before, except that these are now stored in big endian
form instead of cpu endian. This makes the cache more copyable.

(4) fscache_use_cookie() and fscache_unuse_cookie() are called when a file
is opened or closed to prevent a cache file from being culled and to
keep resources to hand that are needed to do I/O.

fscache_use_cookie() is given an indication if the cache is likely to
be modified locally (e.g. the file is open for writing).

fscache_unuse_cookie() is given a coherency update if we had the file
open for writing and will update that.

(5) fscache_invalidate() is now given uptodate auxiliary data and a file
size. It can also take a flag to indicate if this was due to a DIO
write. This is wrapped into afs_fscache_invalidate() now for
convenience.

(6) fscache_resize() now gets called from the finalisation of
afs_setattr(), and afs_setattr() does use/unuse of the cookie around
the call to support this.

(7) fscache_note_page_release() is called from afs_release_page().

(8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
PG_fscache to be cleared.

Render the parts of the cookie key for an afs inode cookie as big endian.

Changes
=======
ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: kafs-testing@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819661382.215744.1485608824741611837.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970002.143852.17678518584089878259.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967174665.1823006.1301789965454084220.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021568841.640689.6684240152253400380.stgit@warthog.procyon.org.uk/ # v4
diff 22650f14 Fri Apr 30 06:47:08 MDT 2021 David Howells <dhowells@redhat.com> afs: Fix speculative status fetches

The generic/464 xfstest causes kAFS to emit occasional warnings of the
form:

kAFS: vnode modified {100055:8a} 30->31 YFS.StoreData64 (c=6015)

This indicates that the data version received back from the server did not
match the expected value (the DV should be incremented monotonically for
each individual modification op committed to a vnode).

What is happening is that a lookup call is doing a bulk status fetch
speculatively on a bunch of vnodes in a directory besides getting the
status of the vnode it's actually interested in. This is racing with a
StoreData operation (though it could also occur with, say, a MakeDir op).

On the client, a modification operation locks the vnode, but the bulk
status fetch only locks the parent directory, so no ordering is imposed
there (thereby avoiding an avenue to deadlock).

On the server, the StoreData op handler doesn't lock the vnode until it's
received all the request data, and downgrades the lock after committing the
data until it has finished sending change notifications to other clients -
which allows the status fetch to occur before it has finished.

This means that:

- a status fetch can access the target vnode either side of the exclusive
section of the modification

- the status fetch could start before the modification, yet finish after,
and vice-versa.

- the status fetch and the modification RPCs can complete in either order.

- the status fetch can return either the before or the after DV from the
modification.

- the status fetch might regress the locally cached DV.

Some of these are handled by the previous fix[1], but that's not sufficient
because it checks the DV it received against the DV it cached at the start
of the op, but the DV might've been updated in the meantime by a locally
generated modification op.

Fix this by the following means:

(1) Keep track of when we're performing a modification operation on a
vnode. This is done by marking vnode parameters with a 'modification'
note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
for the duration.

(2) Alter the speculation race detection to ignore speculative status
fetches if either the vnode is marked as being modified or the data
version number is not what we expected.

Note that whilst the "vnode modified" warning does get recovered from as it
causes the client to refetch the status at the next opportunity, it will
also invalidate the pagecache, so changes might get lost.

Fixes: a9e5c87ca744 ("afs: Fix speculative status fetch going out of order wrt to modifications")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-and-reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/160605082531.252452.14708077925602709042.stgit@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/linux-fsdevel/161961335926.39335.2552653972195467566.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a9e5c87c Sun Nov 22 06:13:45 MST 2020 David Howells <dhowells@redhat.com> afs: Fix speculative status fetch going out of order wrt to modifications

When doing a lookup in a directory, the afs filesystem uses a bulk
status fetch to speculatively retrieve the statuses of up to 48 other
vnodes found in the same directory and it will then either update extant
inodes or create new ones - effectively doing 'lookup ahead'.

To avoid the possibility of deadlocking itself, however, the filesystem
doesn't lock all of those inodes; rather just the directory inode is
locked (by the VFS).

When the operation completes, afs_inode_init_from_status() or
afs_apply_status() is called, depending on whether the inode already
exists, to commit the new status.

A case exists, however, where the speculative status fetch operation may
straddle a modification operation on one of those vnodes. What can then
happen is that the speculative bulk status RPC retrieves the old status,
and whilst that is happening, the modification happens - which returns
an updated status, then the modification status is committed, then we
attempt to commit the speculative status.

This results in something like the following being seen in dmesg:

kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus

showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
say that the vnode had data version 8 when we'd already recorded version
9 due to a local modification. This was causing the cache to be
invalidated for that vnode when it shouldn't have been. If it happens
on a data file, this might lead to local changes being lost.

Fix this by ignoring speculative status updates if the data version
doesn't match the expected value.

Note that it is possible to get a DV regression if a volume gets
restored from a backup - but we should get a callback break in such a
case that should trigger a recheck anyway. It might be worth checking
the volume creation time in the volsync info and, if a change is
observed in that (as would happen on a restore), invalidate all caches
associated with the volume.

Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a9e5c87c Sun Nov 22 06:13:45 MST 2020 David Howells <dhowells@redhat.com> afs: Fix speculative status fetch going out of order wrt to modifications

When doing a lookup in a directory, the afs filesystem uses a bulk
status fetch to speculatively retrieve the statuses of up to 48 other
vnodes found in the same directory and it will then either update extant
inodes or create new ones - effectively doing 'lookup ahead'.

To avoid the possibility of deadlocking itself, however, the filesystem
doesn't lock all of those inodes; rather just the directory inode is
locked (by the VFS).

When the operation completes, afs_inode_init_from_status() or
afs_apply_status() is called, depending on whether the inode already
exists, to commit the new status.

A case exists, however, where the speculative status fetch operation may
straddle a modification operation on one of those vnodes. What can then
happen is that the speculative bulk status RPC retrieves the old status,
and whilst that is happening, the modification happens - which returns
an updated status, then the modification status is committed, then we
attempt to commit the speculative status.

This results in something like the following being seen in dmesg:

kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus

showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
say that the vnode had data version 8 when we'd already recorded version
9 due to a local modification. This was causing the cache to be
invalidated for that vnode when it shouldn't have been. If it happens
on a data file, this might lead to local changes being lost.

Fix this by ignoring speculative status updates if the data version
doesn't match the expected value.

Note that it is possible to get a DV regression if a volume gets
restored from a backup - but we should get a callback break in such a
case that should trigger a recheck anyway. It might be worth checking
the volume creation time in the volsync info and, if a change is
observed in that (as would happen on a restore), invalidate all caches
associated with the volume.

Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 8a070a96 Sat Apr 25 03:26:02 MDT 2020 David Howells <dhowells@redhat.com> afs: Detect cell aliases 1 - Cells with root volumes

Put in the first phase of cell alias detection. This part handles alias
detection for cells that have root.cell volumes (which is expected to be
likely).

When a cell becomes newly active, it is probed for its root.cell volume,
and if it has one, this volume is compared against other root.cell volumes
to find out if the list of fileserver UUIDs have any in common - and if
that's the case, do the address lists of those fileservers have any
addresses in common. If they do, the new cell is adjudged to be an alias
of the old cell and the old cell is used instead.

Comparing is aided by the server list in struct afs_server_list being
sorted in UUID order and the addresses in the fileserver address lists
being sorted in address order.

The cell then retains the afs_volume object for the root.cell volume, even
if it's not mounted for future alias checking.

This necessary because:

(1) Whilst fileservers have UUIDs that are meant to be globally unique, in
practice they are not because cells get cloned without changing the
UUIDs - so afs_server records need to be per cell.

(2) Sometimes the DNS is used to make cell aliases - but if we don't know
they're the same, we may end up with multiple superblocks and multiple
afs_server records for the same thing, impairing our ability to
deliver callback notifications of third party changes

(3) The fileserver RPC API doesn't contain the cell name, so it can't tell
us which cell it's notifying and can't see that a change made to to
one cell should notify the same client that's also accessed as the
other cell.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
diff 8b6a666a Mon May 20 01:48:46 MDT 2019 David Howells <dhowells@redhat.com> afs: Provide an RCU-capable key lookup

Provide an RCU-capable key lookup function. We don't want to call
afs_request_key() in RCU-mode pathwalk as request_key() might sleep, even if
we don't ask it to construct anything as it might find a key that is currently
undergoing construction.

Signed-off-by: David Howells <dhowells@redhat.com>
/linux-master/include/trace/events/
H A Dafs.hdiff 495f2ae9 Wed Oct 18 02:24:01 MDT 2023 David Howells <dhowells@redhat.com> afs: Fix fileserver rotation

Fix the fileserver rotation so that it doesn't use RTT as the basis for
deciding which server and address to use as this doesn't necessarily give a
good indication of the best path. Instead, use the configurable preference
list in conjunction with whatever probes have succeeded at the time of
looking.

To this end, make the following changes:

(1) Keep an array of "server states" to track what addresses we've tried
on each server and move the waitqueue entries there that we'll need
for probing.

(2) Each afs_server_state struct is made to pin the corresponding server's
endpoint state rather than the afs_operation struct carrying a pin on
the server we're currently looking at.

(3) Drop the server list preference; we now always rescan the server list.

(4) afs_wait_for_probes() now uses the server state list to guide it in
what it waits for (and to provide the waitqueue entries) and returns
an indication of whether we'd got a response, run out of responsive
addresses or the endpoint state had been superseded and we need to
restart the iteration.

(5) Call afs_get_address_preferences*() occasionally to refresh the
preference values.

(6) When picking a server, scan the addresses of the servers for which we
have as-yet untested communications, looking for the highest priority
one and use that instead of trying all the addresses for a particular
server in ascending-RTT order.

(7) When a Busy or Offline state is seen across all available servers, do
a short sleep.

(8) If we detect that we accessed a future RO volume version whilst it is
undergoing replication, reissue the op against the older version until
at least half of the servers are replicated.

(9) Whilst RO replication is ongoing, increase the frequency of Volume
Location server checks for that volume to every ten minutes instead of
hourly.

Also add a tracepoint to track progress through the rotation algorithm.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
diff 6c881ca0 Tue Jun 15 04:57:26 MDT 2021 David Howells <dhowells@redhat.com> afs: Fix tracepoint string placement with built-in AFS

To quote Alexey[1]:

I was adding custom tracepoint to the kernel, grabbed full F34 kernel
.config, disabled modules and booted whole shebang as VM kernel.

Then did

perf record -a -e ...

It crashed:

general protection fault, probably for non-canonical address 0x435f5346592e4243: 0000 [#1] SMP PTI
CPU: 1 PID: 842 Comm: cat Not tainted 5.12.6+ #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
RIP: 0010:t_show+0x22/0xd0

Then reproducer was narrowed to

# cat /sys/kernel/tracing/printk_formats

Original F34 kernel with modules didn't crash.

So I started to disable options and after disabling AFS everything
started working again.

The root cause is that AFS was placing char arrays content into a
section full of _pointers_ to strings with predictable consequences.

Non canonical address 435f5346592e4243 is "CB.YFS_" which came from
CM_NAME macro.

Steps to reproduce:

CONFIG_AFS=y
CONFIG_TRACING=y

# cat /sys/kernel/tracing/printk_formats

Fix this by the following means:

(1) Add enum->string translation tables in the event header with the AFS
and YFS cache/callback manager operations listed by RPC operation ID.

(2) Modify the afs_cb_call tracepoint to print the string from the
translation table rather than using the string at the afs_call name
pointer.

(3) Switch translation table depending on the service we're being accessed
as (AFS or YFS) in the tracepoint print clause. Will this cause
problems to userspace utilities?

Note that the symbolic representation of the YFS service ID isn't
available to this header, so I've put it in as a number. I'm not sure
if this is the best way to do this.

(4) Remove the name wrangling (CM_NAME) macro and put the names directly
into the afs_call_type structs in cmservice.c.

Fixes: 8e8d7f13b6d5a9 ("afs: Add some tracepoints")
Reported-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YLAXfvZ+rObEOdc%2F@localhost.localdomain/ [1]
Link: https://lore.kernel.org/r/643721.1623754699@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162430903582.2896199.6098150063997983353.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162609463957.3133237.15916579353149746363.stgit@warthog.procyon.org.uk/ # v1 (repost)
Link: https://lore.kernel.org/r/162610726860.3408253.445207609466288531.stgit@warthog.procyon.org.uk/ # v2
diff 2c099014 Thu Nov 02 09:27:51 MDT 2017 David Howells <dhowells@redhat.com> afs: Trace the sending of pages

Add a pair of tracepoints to log the sending of pages for an FS.StoreData
or FS.StoreData64 operation.

Tracepoint afs_send_pages notes each set of pages added to the operation.
There may be several of these per operation as we get up at most 8
contiguous pages in one go because the bvec we're using is on the stack.

Tracepoint afs_sent_pages notes the end of adding data from a whole run of
pages to the operation and the completion of the request phase.

Signed-off-by: David Howells <dhowells@redhat.com>
8e8d7f13 Thu Jan 05 03:38:34 MST 2017 David Howells <dhowells@redhat.com> afs: Add some tracepoints

Add three tracepoints to the AFS filesystem:

(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().

(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.

(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.

To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.

Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.

Signed-off-by: David Howells <dhowells@redhat.com>
8e8d7f13 Thu Jan 05 03:38:34 MST 2017 David Howells <dhowells@redhat.com> afs: Add some tracepoints

Add three tracepoints to the AFS filesystem:

(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().

(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.

(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.

To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.

Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.

Signed-off-by: David Howells <dhowells@redhat.com>

Completed in 439 milliseconds