Searched +hist:8 +hist:a79790b (Results 1 - 3 of 3) sorted by relevance

/linux-master/fs/afs/
H A Dserver.cdiff eeba1e9c Sat Apr 13 01:37:37 MDT 2019 David Howells <dhowells@redhat.com> afs: Fix in-progess ops to ignore server-level callback invalidation

The in-kernel afs filesystem client counts the number of server-level
callback invalidation events (CB.InitCallBackState* RPC operations) that it
receives from the server. This is stored in cb_s_break in various
structures, including afs_server and afs_vnode.

If an inode is examined by afs_validate(), say, the afs_server copy is
compared, along with other break counters, to those in afs_vnode, and if
one or more of the counters do not match, it is considered that the
server's callback promise is broken. At points where this happens,
AFS_VNODE_CB_PROMISED is cleared to indicate that the status must be
refetched from the server.

afs_validate() issues an FS.FetchStatus operation to get updated metadata -
and based on the updated data_version may invalidate the pagecache too.

However, the break counters are also used to determine whether to note a
new callback in the vnode (which would set the AFS_VNODE_CB_PROMISED flag)
and whether to cache the permit data included in the YFSFetchStatus record
by the server.


The problem comes when the server sends us a CB.InitCallBackState op. The
first such instance doesn't cause cb_s_break to be incremented, but rather
causes AFS_SERVER_FL_NEW to be cleared - but thereafter, say some hours
after last use and all the volumes have been automatically unmounted and
the server has forgotten about the client[*], this *will* likely cause an
increment.

[*] There are other circumstances too, such as the server restarting or
needing to make space in its callback table.

Note that the server won't send us a CB.InitCallBackState op until we talk
to it again.

So what happens is:

(1) A mount for a new volume is attempted, a inode is created for the root
vnode and vnode->cb_s_break and AFS_VNODE_CB_PROMISED aren't set
immediately, as we don't have a nominated server to talk to yet - and
we may iterate through a few to find one.

(2) Before the operation happens, afs_fetch_status(), say, notes in the
cursor (fc.cb_break) the break counter sum from the vnode, volume and
server counters, but the server->cb_s_break is currently 0.

(3) We send FS.FetchStatus to the server. The server sends us back
CB.InitCallBackState. We increment server->cb_s_break.

(4) Our FS.FetchStatus completes. The reply includes a callback record.

(5) xdr_decode_AFSCallBack()/xdr_decode_YFSCallBack() check to see whether
the callback promise was broken by checking the break counter sum from
step (2) against the current sum.

This fails because of step (3), so we don't set the callback record
and, importantly, don't set AFS_VNODE_CB_PROMISED on the vnode.

This does not preclude the syscall from progressing, and we don't loop here
rechecking the status, but rather assume it's good enough for one round
only and will need to be rechecked next time.

(6) afs_validate() it triggered on the vnode, probably called from
d_revalidate() checking the parent directory.

(7) afs_validate() notes that AFS_VNODE_CB_PROMISED isn't set, so doesn't
update vnode->cb_s_break and assumes the vnode to be invalid.

(8) afs_validate() needs to calls afs_fetch_status(). Go back to step (2)
and repeat, every time the vnode is validated.

This primarily affects volume root dir vnodes. Everything subsequent to
those inherit an already incremented cb_s_break upon mounting.


The issue is that we assume that the callback record and the cached permit
information in a reply from the server can't be trusted after getting a
server break - but this is wrong since the server makes sure things are
done in the right order, holding up our ops if necessary[*].

[*] There is an extremely unlikely scenario where a reply from before the
CB.InitCallBackState could get its delivery deferred till after - at
which point we think we have a promise when we don't. This, however,
requires unlucky mass packet loss to one call.

AFS_SERVER_FL_NEW tries to paper over the cracks for the initial mount from
a server we've never contacted before, but this should be unnecessary.
It's also further insulated from the problem on an initial mount by
querying the server first with FS.GetCapabilities, which triggers the
CB.InitCallBackState.


Fix this by

(1) Remove AFS_SERVER_FL_NEW.

(2) In afs_calc_vnode_cb_break(), don't include cb_s_break in the
calculation.

(3) In afs_cb_is_broken(), don't include cb_s_break in the check.


Signed-off-by: David Howells <dhowells@redhat.com>
diff d2ddc776 Thu Nov 02 09:27:50 MDT 2017 David Howells <dhowells@redhat.com> afs: Overhaul volume and server record caching and fileserver rotation

The current code assumes that volumes and servers are per-cell and are
never shared, but this is not enforced, and, indeed, public cells do exist
that are aliases of each other. Further, an organisation can, say, set up
a public cell and a private cell with overlapping, but not identical, sets
of servers. The difference is purely in the database attached to the VL
servers.

The current code will malfunction if it sees a server in two cells as it
assumes global address -> server record mappings and that each server is in
just one cell.

Further, each server may have multiple addresses - and may have addresses
of different families (IPv4 and IPv6, say).

To this end, the following structural changes are made:

(1) Server record management is overhauled:

(a) Server records are made independent of cell. The namespace keeps
track of them, volume records have lists of them and each vnode
has a server on which its callback interest currently resides.

(b) The cell record no longer keeps a list of servers known to be in
that cell.

(c) The server records are now kept in a flat list because there's no
single address to sort on.

(d) Server records are now keyed by their UUID within the namespace.

(e) The addresses for a server are obtained with the VL.GetAddrsU
rather than with VL.GetEntryByName, using the server's UUID as a
parameter.

(f) Cached server records are garbage collected after a period of
non-use and are counted out of existence before purging is allowed
to complete. This protects the work functions against rmmod.

(g) The servers list is now in /proc/fs/afs/servers.

(2) Volume record management is overhauled:

(a) An RCU-replaceable server list is introduced. This tracks both
servers and their coresponding callback interests.

(b) The superblock is now keyed on cell record and numeric volume ID.

(c) The volume record is now tied to the superblock which mounts it,
and is activated when mounted and deactivated when unmounted.
This makes it easier to handle the cache cookie without causing a
double-use in fscache.

(d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
to get the server UUID list.

(e) The volume name is updated if it is seen to have changed when the
volume is updated (the update is keyed on the volume ID).

(3) The vlocation record is got rid of and VLDB records are no longer
cached. Sufficient information is stored in the volume record, though
an update to a volume record is now no longer shared between related
volumes (volumes come in bundles of three: R/W, R/O and backup).

and the following procedural changes are made:

(1) The fileserver cursor introduced previously is now fleshed out and
used to iterate over fileservers and their addresses.

(2) Volume status is checked during iteration, and the server list is
replaced if a change is detected.

(3) Server status is checked during iteration, and the address list is
replaced if a change is detected.

(4) The abort code is saved into the address list cursor and -ECONNABORTED
returned in afs_make_call() if a remote abort happened rather than
translating the abort into an error message. This allows actions to
be taken depending on the abort code more easily.

(a) If a VMOVED abort is seen then this is handled by rechecking the
volume and restarting the iteration.

(b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
handled by sleeping for a short period and retrying and/or trying
other servers that might serve that volume. A message is also
displayed once until the condition has cleared.

(c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
moment.

(d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
see if it has been deleted; if not, the fileserver is probably
indicating that the volume couldn't be attached and needs
salvaging.

(e) If statfs() sees one of these aborts, it does not sleep, but
rather returns an error, so as not to block the umount program.

(5) The fileserver iteration functions in vnode.c are now merged into
their callers and more heavily macroised around the cursor. vnode.c
is removed.

(6) Operations on a particular vnode are serialised on that vnode because
the server will lock that vnode whilst it operates on it, so a second
op sent will just have to wait.

(7) Fileservers are probed with FS.GetCapabilities before being used.
This is where service upgrade will be done.

(8) A callback interest on a fileserver is set up before an FS operation
is performed and passed through to afs_make_call() so that it can be
set on the vnode if the operation returns a callback. The callback
interest is passed through to afs_iget() also so that it can be set
there too.

In general, record updating is done on an as-needed basis when we try to
access servers, volumes or vnodes rather than offloading it to work items
and special threads.

Notes:

(1) Pre AFS-3.4 servers are no longer supported, though this can be added
back if necessary (AFS-3.4 was released in 1998).

(2) VBUSY is retried forever for the moment at intervals of 1s.

(3) /proc/fs/afs/<cell>/servers no longer exists.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 8b2a464c Thu Nov 02 09:27:50 MDT 2017 David Howells <dhowells@redhat.com> afs: Add an address list concept

Add an RCU replaceable address list structure to hold a list of server
addresses. The list also holds the

To this end:

(1) A cell's VL server address list can be loaded directly via insmod or
echo to /proc/fs/afs/cells or dynamically from a DNS query for AFSDB
or SRV records.

(2) Anyone wanting to use a cell's VL server address must wait until the
cell record comes online and has tried to obtain some addresses.

(3) An FS server's address list, for the moment, has a single entry that
is the key to the server list. This will change in the future when a
server is instead keyed on its UUID and the VL.GetAddrsU operation is
used.

(4) An 'address cursor' concept is introduced to handle iteration through
the address list. This is passed to the afs_make_call() as, in the
future, stuff (such as abort code) that doesn't outlast the call will
be returned in it.

In the future, we might want to annotate the list with information about
how each address fares. We might then want to propagate such annotations
over address list replacement.

Whilst we're at it, we allow IPv6 addresses to be specified in
colon-delimited lists by enclosing them in square brackets.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 8a79790b Thu Mar 16 10:27:46 MDT 2017 Tina Ruchandani <ruchandani.tina@gmail.com> afs: Migrate vlocation fields to 64-bit

get_seconds() returns real wall-clock seconds. On 32-bit systems
this value will overflow in year 2038 and beyond. This patch changes
afs's vlocation record to use ktime_get_real_seconds() instead, for the
fields time_of_death and update_at.

Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
diff 8a79790b Thu Mar 16 10:27:46 MDT 2017 Tina Ruchandani <ruchandani.tina@gmail.com> afs: Migrate vlocation fields to 64-bit

get_seconds() returns real wall-clock seconds. On 32-bit systems
this value will overflow in year 2038 and beyond. This patch changes
afs's vlocation record to use ktime_get_real_seconds() instead, for the
fields time_of_death and update_at.

Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
H A Dcallback.cdiff eeba1e9c Sat Apr 13 01:37:37 MDT 2019 David Howells <dhowells@redhat.com> afs: Fix in-progess ops to ignore server-level callback invalidation

The in-kernel afs filesystem client counts the number of server-level
callback invalidation events (CB.InitCallBackState* RPC operations) that it
receives from the server. This is stored in cb_s_break in various
structures, including afs_server and afs_vnode.

If an inode is examined by afs_validate(), say, the afs_server copy is
compared, along with other break counters, to those in afs_vnode, and if
one or more of the counters do not match, it is considered that the
server's callback promise is broken. At points where this happens,
AFS_VNODE_CB_PROMISED is cleared to indicate that the status must be
refetched from the server.

afs_validate() issues an FS.FetchStatus operation to get updated metadata -
and based on the updated data_version may invalidate the pagecache too.

However, the break counters are also used to determine whether to note a
new callback in the vnode (which would set the AFS_VNODE_CB_PROMISED flag)
and whether to cache the permit data included in the YFSFetchStatus record
by the server.


The problem comes when the server sends us a CB.InitCallBackState op. The
first such instance doesn't cause cb_s_break to be incremented, but rather
causes AFS_SERVER_FL_NEW to be cleared - but thereafter, say some hours
after last use and all the volumes have been automatically unmounted and
the server has forgotten about the client[*], this *will* likely cause an
increment.

[*] There are other circumstances too, such as the server restarting or
needing to make space in its callback table.

Note that the server won't send us a CB.InitCallBackState op until we talk
to it again.

So what happens is:

(1) A mount for a new volume is attempted, a inode is created for the root
vnode and vnode->cb_s_break and AFS_VNODE_CB_PROMISED aren't set
immediately, as we don't have a nominated server to talk to yet - and
we may iterate through a few to find one.

(2) Before the operation happens, afs_fetch_status(), say, notes in the
cursor (fc.cb_break) the break counter sum from the vnode, volume and
server counters, but the server->cb_s_break is currently 0.

(3) We send FS.FetchStatus to the server. The server sends us back
CB.InitCallBackState. We increment server->cb_s_break.

(4) Our FS.FetchStatus completes. The reply includes a callback record.

(5) xdr_decode_AFSCallBack()/xdr_decode_YFSCallBack() check to see whether
the callback promise was broken by checking the break counter sum from
step (2) against the current sum.

This fails because of step (3), so we don't set the callback record
and, importantly, don't set AFS_VNODE_CB_PROMISED on the vnode.

This does not preclude the syscall from progressing, and we don't loop here
rechecking the status, but rather assume it's good enough for one round
only and will need to be rechecked next time.

(6) afs_validate() it triggered on the vnode, probably called from
d_revalidate() checking the parent directory.

(7) afs_validate() notes that AFS_VNODE_CB_PROMISED isn't set, so doesn't
update vnode->cb_s_break and assumes the vnode to be invalid.

(8) afs_validate() needs to calls afs_fetch_status(). Go back to step (2)
and repeat, every time the vnode is validated.

This primarily affects volume root dir vnodes. Everything subsequent to
those inherit an already incremented cb_s_break upon mounting.


The issue is that we assume that the callback record and the cached permit
information in a reply from the server can't be trusted after getting a
server break - but this is wrong since the server makes sure things are
done in the right order, holding up our ops if necessary[*].

[*] There is an extremely unlikely scenario where a reply from before the
CB.InitCallBackState could get its delivery deferred till after - at
which point we think we have a promise when we don't. This, however,
requires unlucky mass packet loss to one call.

AFS_SERVER_FL_NEW tries to paper over the cracks for the initial mount from
a server we've never contacted before, but this should be unnecessary.
It's also further insulated from the problem on an initial mount by
querying the server first with FS.GetCapabilities, which triggers the
CB.InitCallBackState.


Fix this by

(1) Remove AFS_SERVER_FL_NEW.

(2) In afs_calc_vnode_cb_break(), don't include cb_s_break in the
calculation.

(3) In afs_cb_is_broken(), don't include cb_s_break in the check.


Signed-off-by: David Howells <dhowells@redhat.com>
diff d2ddc776 Thu Nov 02 09:27:50 MDT 2017 David Howells <dhowells@redhat.com> afs: Overhaul volume and server record caching and fileserver rotation

The current code assumes that volumes and servers are per-cell and are
never shared, but this is not enforced, and, indeed, public cells do exist
that are aliases of each other. Further, an organisation can, say, set up
a public cell and a private cell with overlapping, but not identical, sets
of servers. The difference is purely in the database attached to the VL
servers.

The current code will malfunction if it sees a server in two cells as it
assumes global address -> server record mappings and that each server is in
just one cell.

Further, each server may have multiple addresses - and may have addresses
of different families (IPv4 and IPv6, say).

To this end, the following structural changes are made:

(1) Server record management is overhauled:

(a) Server records are made independent of cell. The namespace keeps
track of them, volume records have lists of them and each vnode
has a server on which its callback interest currently resides.

(b) The cell record no longer keeps a list of servers known to be in
that cell.

(c) The server records are now kept in a flat list because there's no
single address to sort on.

(d) Server records are now keyed by their UUID within the namespace.

(e) The addresses for a server are obtained with the VL.GetAddrsU
rather than with VL.GetEntryByName, using the server's UUID as a
parameter.

(f) Cached server records are garbage collected after a period of
non-use and are counted out of existence before purging is allowed
to complete. This protects the work functions against rmmod.

(g) The servers list is now in /proc/fs/afs/servers.

(2) Volume record management is overhauled:

(a) An RCU-replaceable server list is introduced. This tracks both
servers and their coresponding callback interests.

(b) The superblock is now keyed on cell record and numeric volume ID.

(c) The volume record is now tied to the superblock which mounts it,
and is activated when mounted and deactivated when unmounted.
This makes it easier to handle the cache cookie without causing a
double-use in fscache.

(d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
to get the server UUID list.

(e) The volume name is updated if it is seen to have changed when the
volume is updated (the update is keyed on the volume ID).

(3) The vlocation record is got rid of and VLDB records are no longer
cached. Sufficient information is stored in the volume record, though
an update to a volume record is now no longer shared between related
volumes (volumes come in bundles of three: R/W, R/O and backup).

and the following procedural changes are made:

(1) The fileserver cursor introduced previously is now fleshed out and
used to iterate over fileservers and their addresses.

(2) Volume status is checked during iteration, and the server list is
replaced if a change is detected.

(3) Server status is checked during iteration, and the address list is
replaced if a change is detected.

(4) The abort code is saved into the address list cursor and -ECONNABORTED
returned in afs_make_call() if a remote abort happened rather than
translating the abort into an error message. This allows actions to
be taken depending on the abort code more easily.

(a) If a VMOVED abort is seen then this is handled by rechecking the
volume and restarting the iteration.

(b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
handled by sleeping for a short period and retrying and/or trying
other servers that might serve that volume. A message is also
displayed once until the condition has cleared.

(c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
moment.

(d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
see if it has been deleted; if not, the fileserver is probably
indicating that the volume couldn't be attached and needs
salvaging.

(e) If statfs() sees one of these aborts, it does not sleep, but
rather returns an error, so as not to block the umount program.

(5) The fileserver iteration functions in vnode.c are now merged into
their callers and more heavily macroised around the cursor. vnode.c
is removed.

(6) Operations on a particular vnode are serialised on that vnode because
the server will lock that vnode whilst it operates on it, so a second
op sent will just have to wait.

(7) Fileservers are probed with FS.GetCapabilities before being used.
This is where service upgrade will be done.

(8) A callback interest on a fileserver is set up before an FS operation
is performed and passed through to afs_make_call() so that it can be
set on the vnode if the operation returns a callback. The callback
interest is passed through to afs_iget() also so that it can be set
there too.

In general, record updating is done on an as-needed basis when we try to
access servers, volumes or vnodes rather than offloading it to work items
and special threads.

Notes:

(1) Pre AFS-3.4 servers are no longer supported, though this can be added
back if necessary (AFS-3.4 was released in 1998).

(2) VBUSY is retried forever for the moment at intervals of 1s.

(3) /proc/fs/afs/<cell>/servers no longer exists.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 8a79790b Thu Mar 16 10:27:46 MDT 2017 Tina Ruchandani <ruchandani.tina@gmail.com> afs: Migrate vlocation fields to 64-bit

get_seconds() returns real wall-clock seconds. On 32-bit systems
this value will overflow in year 2038 and beyond. This patch changes
afs's vlocation record to use ktime_get_real_seconds() instead, for the
fields time_of_death and update_at.

Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
diff 8a79790b Thu Mar 16 10:27:46 MDT 2017 Tina Ruchandani <ruchandani.tina@gmail.com> afs: Migrate vlocation fields to 64-bit

get_seconds() returns real wall-clock seconds. On 32-bit systems
this value will overflow in year 2038 and beyond. This patch changes
afs's vlocation record to use ktime_get_real_seconds() instead, for the
fields time_of_death and update_at.

Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
H A Dinternal.hdiff 495f2ae9 Wed Oct 18 02:24:01 MDT 2023 David Howells <dhowells@redhat.com> afs: Fix fileserver rotation

Fix the fileserver rotation so that it doesn't use RTT as the basis for
deciding which server and address to use as this doesn't necessarily give a
good indication of the best path. Instead, use the configurable preference
list in conjunction with whatever probes have succeeded at the time of
looking.

To this end, make the following changes:

(1) Keep an array of "server states" to track what addresses we've tried
on each server and move the waitqueue entries there that we'll need
for probing.

(2) Each afs_server_state struct is made to pin the corresponding server's
endpoint state rather than the afs_operation struct carrying a pin on
the server we're currently looking at.

(3) Drop the server list preference; we now always rescan the server list.

(4) afs_wait_for_probes() now uses the server state list to guide it in
what it waits for (and to provide the waitqueue entries) and returns
an indication of whether we'd got a response, run out of responsive
addresses or the endpoint state had been superseded and we need to
restart the iteration.

(5) Call afs_get_address_preferences*() occasionally to refresh the
preference values.

(6) When picking a server, scan the addresses of the servers for which we
have as-yet untested communications, looking for the highest priority
one and use that instead of trying all the addresses for a particular
server in ascending-RTT order.

(7) When a Busy or Offline state is seen across all available servers, do
a short sleep.

(8) If we detect that we accessed a future RO volume version whilst it is
undergoing replication, reissue the op against the older version until
at least half of the servers are replicated.

(9) Whilst RO replication is ongoing, increase the frequency of Volume
Location server checks for that volume to every ten minutes instead of
hourly.

Also add a tracepoint to track progress through the rotation algorithm.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
diff e6bace73 Thu Nov 02 10:26:59 MDT 2023 David Howells <dhowells@redhat.com> afs: Fix afs_server_list to be cleaned up with RCU

afs_server_list is accessed with the rcu_read_lock() held from
volume->servers, so it needs to be cleaned up correctly.

Fix this by using kfree_rcu() instead of kfree().

Fixes: 8a070a964877 ("afs: Detect cell aliases 1 - Cells with root volumes")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
diff bc899ee1 Tue Jun 29 15:37:05 MDT 2021 David Howells <dhowells@redhat.com> netfs: Add a netfs inode context

Add a netfs_i_context struct that should be included in the network
filesystem's own inode struct wrapper, directly after the VFS's inode
struct, e.g.:

struct my_inode {
struct {
/* These must be contiguous */
struct inode vfs_inode;
struct netfs_i_context netfs_ctx;
};
};

The netfs_i_context struct so far contains a single field for the network
filesystem to use - the cache cookie:

struct netfs_i_context {
...
struct fscache_cookie *cache;
};

Three functions are provided to help with this:

(1) void netfs_i_context_init(struct inode *inode,
const struct netfs_request_ops *ops);

Initialise the netfs context and set the operations.

(2) struct netfs_i_context *netfs_i_context(struct inode *inode);

Find the netfs context from the VFS inode.

(3) struct inode *netfs_inode(struct netfs_i_context *ctx);

Find the VFS inode from the netfs context.

Changes
=======
ver #4)
- Fix netfs_is_cache_enabled() to check cookie->cache_priv to see if a
cache is present[3].
- Fix netfs_skip_folio_read() to zero out all of the page, not just some
of it[3].

ver #3)
- Split out the bit to move ceph cap-getting on readahead into
ceph_init_request()[1].
- Stick in a comment to the netfs inode structs indicating the contiguity
requirements[2].

ver #2)
- Adjust documentation to match.
- Use "#if IS_ENABLED()" in netfs_i_cookie(), not "#ifdef".
- Move the cap check from ceph_readahead() to ceph_init_request() to be
called from netfslib.
- Remove ceph_readahead() and use netfs_readahead() directly instead.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com

Link: https://lore.kernel.org/r/8af0d47f17d89c06bbf602496dd845f2b0bf25b3.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/beaf4f6a6c2575ed489adb14b257253c868f9a5c.camel@kernel.org/ [2]
Link: https://lore.kernel.org/r/3536452.1647421585@warthog.procyon.org.uk/ [3]
Link: https://lore.kernel.org/r/164622984545.3564931.15691742939278418580.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/164678213320.1200972.16807551936267647470.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/164692909854.2099075.9535537286264248057.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/306388.1647595110@warthog.procyon.org.uk/ # v4
diff 8fb72b4a Wed Feb 09 13:22:01 MST 2022 Matthew Wilcox (Oracle) <willy@infradead.org> fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()

Convert all users of fscache_set_page_dirty to use fscache_dirty_folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
diff 523d27cd Thu Feb 06 07:22:21 MST 2020 David Howells <dhowells@redhat.com> afs: Convert afs to use the new fscache API

Change the afs filesystem to support the new afs driver.

The following changes have been made:

(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole. There's also no longer a cell cookie.

(2) The volume cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). This function takes three parameters: a
string representing the "volume" in the index, a string naming the
cache to use (or NULL) and a u64 that conveys coherency metadata for
the volume.

For afs, I've made it render the volume name string as:

"afs,<cell>,<volume_id>"

and the coherency data is currently 0.

(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.

fscache_acquire_cookie() is passed the same keying and coherency
information as before, except that these are now stored in big endian
form instead of cpu endian. This makes the cache more copyable.

(4) fscache_use_cookie() and fscache_unuse_cookie() are called when a file
is opened or closed to prevent a cache file from being culled and to
keep resources to hand that are needed to do I/O.

fscache_use_cookie() is given an indication if the cache is likely to
be modified locally (e.g. the file is open for writing).

fscache_unuse_cookie() is given a coherency update if we had the file
open for writing and will update that.

(5) fscache_invalidate() is now given uptodate auxiliary data and a file
size. It can also take a flag to indicate if this was due to a DIO
write. This is wrapped into afs_fscache_invalidate() now for
convenience.

(6) fscache_resize() now gets called from the finalisation of
afs_setattr(), and afs_setattr() does use/unuse of the cookie around
the call to support this.

(7) fscache_note_page_release() is called from afs_release_page().

(8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
PG_fscache to be cleared.

Render the parts of the cookie key for an afs inode cookie as big endian.

Changes
=======
ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: kafs-testing@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819661382.215744.1485608824741611837.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970002.143852.17678518584089878259.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967174665.1823006.1301789965454084220.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021568841.640689.6684240152253400380.stgit@warthog.procyon.org.uk/ # v4
diff 22650f14 Fri Apr 30 06:47:08 MDT 2021 David Howells <dhowells@redhat.com> afs: Fix speculative status fetches

The generic/464 xfstest causes kAFS to emit occasional warnings of the
form:

kAFS: vnode modified {100055:8a} 30->31 YFS.StoreData64 (c=6015)

This indicates that the data version received back from the server did not
match the expected value (the DV should be incremented monotonically for
each individual modification op committed to a vnode).

What is happening is that a lookup call is doing a bulk status fetch
speculatively on a bunch of vnodes in a directory besides getting the
status of the vnode it's actually interested in. This is racing with a
StoreData operation (though it could also occur with, say, a MakeDir op).

On the client, a modification operation locks the vnode, but the bulk
status fetch only locks the parent directory, so no ordering is imposed
there (thereby avoiding an avenue to deadlock).

On the server, the StoreData op handler doesn't lock the vnode until it's
received all the request data, and downgrades the lock after committing the
data until it has finished sending change notifications to other clients -
which allows the status fetch to occur before it has finished.

This means that:

- a status fetch can access the target vnode either side of the exclusive
section of the modification

- the status fetch could start before the modification, yet finish after,
and vice-versa.

- the status fetch and the modification RPCs can complete in either order.

- the status fetch can return either the before or the after DV from the
modification.

- the status fetch might regress the locally cached DV.

Some of these are handled by the previous fix[1], but that's not sufficient
because it checks the DV it received against the DV it cached at the start
of the op, but the DV might've been updated in the meantime by a locally
generated modification op.

Fix this by the following means:

(1) Keep track of when we're performing a modification operation on a
vnode. This is done by marking vnode parameters with a 'modification'
note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
for the duration.

(2) Alter the speculation race detection to ignore speculative status
fetches if either the vnode is marked as being modified or the data
version number is not what we expected.

Note that whilst the "vnode modified" warning does get recovered from as it
causes the client to refetch the status at the next opportunity, it will
also invalidate the pagecache, so changes might get lost.

Fixes: a9e5c87ca744 ("afs: Fix speculative status fetch going out of order wrt to modifications")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-and-reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/160605082531.252452.14708077925602709042.stgit@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/linux-fsdevel/161961335926.39335.2552653972195467566.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a9e5c87c Sun Nov 22 06:13:45 MST 2020 David Howells <dhowells@redhat.com> afs: Fix speculative status fetch going out of order wrt to modifications

When doing a lookup in a directory, the afs filesystem uses a bulk
status fetch to speculatively retrieve the statuses of up to 48 other
vnodes found in the same directory and it will then either update extant
inodes or create new ones - effectively doing 'lookup ahead'.

To avoid the possibility of deadlocking itself, however, the filesystem
doesn't lock all of those inodes; rather just the directory inode is
locked (by the VFS).

When the operation completes, afs_inode_init_from_status() or
afs_apply_status() is called, depending on whether the inode already
exists, to commit the new status.

A case exists, however, where the speculative status fetch operation may
straddle a modification operation on one of those vnodes. What can then
happen is that the speculative bulk status RPC retrieves the old status,
and whilst that is happening, the modification happens - which returns
an updated status, then the modification status is committed, then we
attempt to commit the speculative status.

This results in something like the following being seen in dmesg:

kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus

showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
say that the vnode had data version 8 when we'd already recorded version
9 due to a local modification. This was causing the cache to be
invalidated for that vnode when it shouldn't have been. If it happens
on a data file, this might lead to local changes being lost.

Fix this by ignoring speculative status updates if the data version
doesn't match the expected value.

Note that it is possible to get a DV regression if a volume gets
restored from a backup - but we should get a callback break in such a
case that should trigger a recheck anyway. It might be worth checking
the volume creation time in the volsync info and, if a change is
observed in that (as would happen on a restore), invalidate all caches
associated with the volume.

Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a9e5c87c Sun Nov 22 06:13:45 MST 2020 David Howells <dhowells@redhat.com> afs: Fix speculative status fetch going out of order wrt to modifications

When doing a lookup in a directory, the afs filesystem uses a bulk
status fetch to speculatively retrieve the statuses of up to 48 other
vnodes found in the same directory and it will then either update extant
inodes or create new ones - effectively doing 'lookup ahead'.

To avoid the possibility of deadlocking itself, however, the filesystem
doesn't lock all of those inodes; rather just the directory inode is
locked (by the VFS).

When the operation completes, afs_inode_init_from_status() or
afs_apply_status() is called, depending on whether the inode already
exists, to commit the new status.

A case exists, however, where the speculative status fetch operation may
straddle a modification operation on one of those vnodes. What can then
happen is that the speculative bulk status RPC retrieves the old status,
and whilst that is happening, the modification happens - which returns
an updated status, then the modification status is committed, then we
attempt to commit the speculative status.

This results in something like the following being seen in dmesg:

kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus

showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
say that the vnode had data version 8 when we'd already recorded version
9 due to a local modification. This was causing the cache to be
invalidated for that vnode when it shouldn't have been. If it happens
on a data file, this might lead to local changes being lost.

Fix this by ignoring speculative status updates if the data version
doesn't match the expected value.

Note that it is possible to get a DV regression if a volume gets
restored from a backup - but we should get a callback break in such a
case that should trigger a recheck anyway. It might be worth checking
the volume creation time in the volsync info and, if a change is
observed in that (as would happen on a restore), invalidate all caches
associated with the volume.

Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 8a070a96 Sat Apr 25 03:26:02 MDT 2020 David Howells <dhowells@redhat.com> afs: Detect cell aliases 1 - Cells with root volumes

Put in the first phase of cell alias detection. This part handles alias
detection for cells that have root.cell volumes (which is expected to be
likely).

When a cell becomes newly active, it is probed for its root.cell volume,
and if it has one, this volume is compared against other root.cell volumes
to find out if the list of fileserver UUIDs have any in common - and if
that's the case, do the address lists of those fileservers have any
addresses in common. If they do, the new cell is adjudged to be an alias
of the old cell and the old cell is used instead.

Comparing is aided by the server list in struct afs_server_list being
sorted in UUID order and the addresses in the fileserver address lists
being sorted in address order.

The cell then retains the afs_volume object for the root.cell volume, even
if it's not mounted for future alias checking.

This necessary because:

(1) Whilst fileservers have UUIDs that are meant to be globally unique, in
practice they are not because cells get cloned without changing the
UUIDs - so afs_server records need to be per cell.

(2) Sometimes the DNS is used to make cell aliases - but if we don't know
they're the same, we may end up with multiple superblocks and multiple
afs_server records for the same thing, impairing our ability to
deliver callback notifications of third party changes

(3) The fileserver RPC API doesn't contain the cell name, so it can't tell
us which cell it's notifying and can't see that a change made to to
one cell should notify the same client that's also accessed as the
other cell.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
diff 8b6a666a Mon May 20 01:48:46 MDT 2019 David Howells <dhowells@redhat.com> afs: Provide an RCU-capable key lookup

Provide an RCU-capable key lookup function. We don't want to call
afs_request_key() in RCU-mode pathwalk as request_key() might sleep, even if
we don't ask it to construct anything as it might find a key that is currently
undergoing construction.

Signed-off-by: David Howells <dhowells@redhat.com>

Completed in 258 milliseconds