History log of /linux-master/fs/dlm/lowcomms.c
Revision Date Author Comments
# e9cdebbe 06-Nov-2023 Jordan Rife <jrife@google.com>

dlm: use kernel_connect() and kernel_bind()

Recent changes to kernel_connect() and kernel_bind() ensure that
callers are insulated from changes to the address parameter made by BPF
SOCK_ADDR hooks. This patch wraps direct calls to ops->connect() and
ops->bind() with kernel_connect() and kernel_bind() to protect callers
in such cases.

Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/
Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect")
Fixes: 4fbac77d2d09 ("bpf: Hooks for sys_bind")
Cc: stable@vger.kernel.org
Signed-off-by: Jordan Rife <jrife@google.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# a470cb2a 10-Oct-2023 Alexander Aring <aahringo@redhat.com>

dlm: slow down filling up processing queue

If there is a burst of message the receive worker will filling up the
processing queue but where are too slow to process dlm messages. This
patch will slow down the receiver worker to keep the buffer on the
socket layer to tell the sender to backoff. This is done by a threshold
to get the next buffers from the socket after all messages were
processed done by a flush_workqueue(). This however only occurs when we
have a message burst when we e.g. create 1 million locks. If we put more
and more new messages to process in the processqueue we will soon run out
of memory.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 4b056db8 01-Aug-2023 Alexander Aring <aahringo@redhat.com>

fs: dlm: remove unused processed_nodes

The variable processed_nodes is not being used by commit 1696c75f1864
("fs: dlm: add send ack threshold and append acks to msgs"). This patch
removes the leftover of this commit.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# a1a5e875 23-Jun-2023 David Howells <dhowells@redhat.com>

dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage

When transmitting data, call down a layer using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather using
sendpage. This allows ->sendpage() to be replaced by something that can
handle multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christine Caulfield <ccaulfie@redhat.com>
cc: David Teigland <teigland@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: cluster-devel@redhat.com
Link: https://lore.kernel.org/r/20230623225513.2732256-7-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 1696c75f 29-May-2023 Alexander Aring <aahringo@redhat.com>

fs: dlm: add send ack threshold and append acks to msgs

This patch changes the time when we sending an ack back to tell the
other side it can free some message because it is arrived on the
receiver node, due random reconnects e.g. TCP resets this is handled as
well on application layer to not let DLM run into a deadlock state.

The current handling has the following problems:

1. We end in situations that we only send an ack back message of 16
bytes out and no other messages. Whereas DLM has logic to combine
so much messages as it can in one send() socket call. This behaviour
can be discovered by "trace-cmd start -e dlm_recv" and observing the
ret field being 16 bytes.

2. When processing of DLM messages will never end because we receive a
lot of messages, we will not send an ack back as it happens when
the processing loop ends.

This patch introduces a likely and unlikely threshold case. The likely
case will send an ack back on a transmit path if the threshold is
triggered of amount of processed upper layer protocol. This will solve
issue 1 because it will be send when another normal DLM message will be
sent. It solves issue 2 because it is not part of the processing loop.

There is however a unlikely case, the unlikely case has a bigger
threshold and will be triggered when we only receive messages and do not
sent any message back. This case avoids that the sending node will keep
a lot of message for a long time as we send sometimes ack backs to tell
the sender to finally release messages.

The atomic cmpxchg() is there to provide a atomically ack send with
reset of the upper layer protocol delivery counter.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 07ee3867 29-May-2023 Alexander Aring <aahringo@redhat.com>

fs: dlm: filter ourself midcomms calls

It makes no sense to call midcomms/lowcomms functionality for the local
node as socket functionality is only required for remote nodes. This
patch filters those calls in the upper layer of lockspace membership
handling instead of doing it in midcomms/lowcomms layer as they should
never be aware of local nodeid.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# d41a1a3d 29-May-2023 Alexander Aring <aahringo@redhat.com>

fs: dlm: cleanup STOP_IO bitflag set when stop io

There should no difference between setting the CF_IO_STOP flag
before restore_callbacks() to do it before or afterwards. The
restore_callbacks() will be sure that no callback is executed anymore
when the bit wasn't set.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# f8bce79d 29-May-2023 Alexander Aring <aahringo@redhat.com>

fs: dlm: don't check othercon twice

This patch removes an another check if con->othercon set inside the
branch which already does that.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# f68bb23c 29-May-2023 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix missing pending to false

This patch sets the process_dlm_messages_pending boolean to false when
there was no message to process. It is a case which should not happen
but if we are prepared to recover from this situation by setting pending
boolean to false.

Cc: stable@vger.kernel.org
Fixes: dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 2f2d9972 15-Mar-2023 Eric Dumazet <edumazet@google.com>

net: annotate lockless accesses to sk->sk_err_soft

This field can be read/written without lock synchronization.

tcp and dccp have been handled in different patches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7386457a 06-Mar-2023 Edwin Török <edvin.torok@citrix.com>

DLM: increase socket backlog to avoid hangs with 16 nodes

On a 16 node virtual cluster with e1000 NICs joining the 12th node prints
SYN flood warnings for the DLM port:
Dec 21 01:46:41 localhost kernel: [ 2146.516664] TCP: request_sock_TCP: Possible SYN flooding on port 21064. Sending cookies. Check SNMP counters.

And then joining a DLM lockspace hangs:
```
Dec 21 01:49:00 localhost kernel: [ 2285.780913] INFO: task xapi-clusterd:17638 blocked for more than 120 seconds.
Dec 21 01:49:00 localhost kernel: [ 2285.786476] Not tainted 4.4.0+10 #1
Dec 21 01:49:00 localhost kernel: [ 2285.789043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 21 01:49:00 localhost kernel: [ 2285.794611] xapi-clusterd D ffff88001930bc58 0 17638 1 0x00000000
Dec 21 01:49:00 localhost kernel: [ 2285.794615] ffff88001930bc58 ffff880025593800 ffff880022433800 ffff88001930c000
Dec 21 01:49:00 localhost kernel: [ 2285.794617] ffff88000ef4a660 ffff88000ef4a658 ffff880022433800 ffff88000ef4a000
Dec 21 01:49:00 localhost kernel: [ 2285.794619] ffff88001930bc70 ffffffff8159f6b4 7fffffffffffffff ffff88001930bd10
Dec 21 01:49:00 localhost kernel: [ 2285.794644] [<ffffffff811570fe>] ? printk+0x4d/0x4f
Dec 21 01:49:00 localhost kernel: [ 2285.794647] [<ffffffff810b1741>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
Dec 21 01:49:00 localhost kernel: [ 2285.794649] [<ffffffff815a085d>] wait_for_completion+0x9d/0x110
Dec 21 01:49:00 localhost kernel: [ 2285.794653] [<ffffffff810979e0>] ? wake_up_q+0x80/0x80
Dec 21 01:49:00 localhost kernel: [ 2285.794661] [<ffffffffa03fa4b8>] dlm_new_lockspace+0x908/0xac0 [dlm]
Dec 21 01:49:00 localhost kernel: [ 2285.794665] [<ffffffff810aaa60>] ? prepare_to_wait_event+0x100/0x100
Dec 21 01:49:00 localhost kernel: [ 2285.794670] [<ffffffffa0402e37>] device_write+0x497/0x6b0 [dlm]
Dec 21 01:49:00 localhost kernel: [ 2285.794673] [<ffffffff811834f0>] ? handle_mm_fault+0x7f0/0x13b0
Dec 21 01:49:00 localhost kernel: [ 2285.794677] [<ffffffff811b4438>] __vfs_write+0x28/0xd0
Dec 21 01:49:00 localhost kernel: [ 2285.794679] [<ffffffff811b4b7f>] ? rw_verify_area+0x6f/0xd0
Dec 21 01:49:00 localhost kernel: [ 2285.794681] [<ffffffff811b4dc1>] vfs_write+0xb1/0x190
Dec 21 01:49:00 localhost kernel: [ 2285.794686] [<ffffffff8105ffc2>] ? __do_page_fault+0x302/0x420
Dec 21 01:49:00 localhost kernel: [ 2285.794688] [<ffffffff811b5986>] SyS_write+0x46/0xa0
Dec 21 01:49:00 localhost kernel: [ 2285.794690] [<ffffffff815a31ae>] entry_SYSCALL_64_fastpath+0x12/0x71
```

The previous limit of 5 seems like an arbitrary number, that doesn't match any
known DLM cluster size upper bound limit.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 00f30c05 06-Mar-2023 Alexander Aring <aahringo@redhat.com>

fs: dlm: add unbound flag to dlm_io workqueue

This patch will add the WQ_UNBOUND flag to the lowcomms dlm_io workqueue
which handles socket io handling to send and receive dlm messages.
The amount of sockets will be 2 for a 3 node cluster. Each socket has
two different workers for doing send and receive work by calling socket
API functionality. Each worker will do their task in order to send dlm
messages in a ordered stream based socket communication. On receive
side the receive buffer will be queued up for an ordered dlm_process
workqueue to parse received dlm messages. The parsing need to be done
currently in an ordered synchronized way because the dlm message processing
is not being made to parse parallel.

After explaining all those workqueue behaviours in lowcomms, the dlm_io
workqueue is only being used for socket handling. Each socket handling
has 2 workers (send and receive). In a 3 cluster node we will end up
with 4 workers. Without the WQ_UNBOUND flag the workers are tight to a
CPU and can never switch, this could be an advantage because local CPU
execution. However with dlm_locktorture testcase I expierenced not all
workers are always in use and my assumption is that some workers are
bound to the same CPU. We should always send or receive when we are
ready to do so, one reason why we disable nigel algorithm on sockets.
We should be safe to do the socket io handling on any CPU which can be
switched during runtime. There is no assumption that the worker stays on
the same CPU. There is no need to respect any workqueue concurrency
model that each worker can only run on one CPU. Lowcomms queue_work()
mechanism has an higher level flag to be sure that it can't schedule
work if the previous worker did not signal it to keep ordered socket
handling. Therefore this patch sets the WQ_UNBOUND flag to allow workers
being executed by any available CPU.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 54fbe0c1 12-Jan-2023 Alexander Aring <aahringo@redhat.com>

fs: dlm: bring back previous shutdown handling

This patch mostly reverts commit 4f567acb0b86 ("fs: dlm: remove socket
shutdown handling"). There can be situations where the dlm midcomms nodes
hash and lowcomms connection hash are not equal, but we need to guarantee
that the lowcomms are all closed on a last release of a dlm lockspace,
when a shutdown is invoked. This patch guarantees that we always close
all sockets managed by the lowcomms connection hash, and calls shutdown
for the last message sent. This ensures we don't cut the socket, which
could cause the peer to get a connection reset.

In future we should try to merge the midcomms/lowcomms hashes into one
hash and not handle both in separate hashes.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 40e0b090 19-Jan-2023 Peilin Ye <peilin.ye@bytedance.com>

net/sock: Introduce trace_sk_data_ready()

As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations. For example:

<...>
iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 98123866 16-Dec-2022 Benjamin Coddington <bcodding@redhat.com>

Treewide: Stop corrupting socket's task_frag

Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag. The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code. I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false. Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 7a5e9f1f 22-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix building without lockdep

This patch uses assert_spin_locked() instead of lockdep_is_held()
where it's available to use because lockdep_is_held() is only available
if CONFIG_LOCKDEP is set.

In other cases like lockdep_sock_is_held() we surround it by a
CONFIG_LOCKDEP idef.

Fixes: dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# dbb751ff 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: parallelize lowcomms socket handling

This patch is rework of lowcomms handling, the main goal was here to
handle recvmsg() and sendpage() to run parallel. Parallel in two senses:
1. per connection and 2. that recvmsg()/sendpage() doesn't block each
other.

Currently recvmsg()/sendpage() cannot run parallel because two
workqueues "dlm_recv" and "dlm_send" are ordered workqueues. That means
only one work item can be executed. The amount of queue items will be
increased about the amount of nodes being inside the cluster. The current
two workqueues for sending and receiving can also block each other if the
same connection is executed at the same time in dlm_recv and dlm_send
workqueue because a per connection mutex for the socket handling.

To make it more parallel we introduce one "dlm_io" workqueue which is
not an ordered workqueue, the amount of workers are not limited. Due
per connection flags SEND/RECV pending we schedule workers ordered per
connection and per send and receive task. To get rid of the mutex
blocking same workers to do socket handling we switched to a semaphore
which handles socket operations as read lock and sock releases as write
operations, to prevent sock_release() being called while the socket is
being used.

There might be more optimization removing the semaphore and replacing it
with other synchronization mechanism, however due other circumstances
e.g. othercon behaviour it seems complicated to doing this change. I
added comments to remove the othercon handling and moving to a different
synchronization mechanism as this is done. We need to do that to the next
dlm major version upgrade because it is not backwards compatible with the
current connect mechanism.

The processing of dlm messages need to be still handled by a ordered
workqueue. An dlm_process ordered workqueue was introduced which gets
filled by the receive worker. This is probably the next bottleneck of
DLM but the application can't currently parse dlm messages parallel. A
comment was introduced to lift the workqueue context of dlm processing
in a non-sleepable softirq to get messages processing done fast.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1351975a 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: don't init error value

This patch removes a init of an error value to -EINVAL which is not
necessary.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# c852a6d7 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: use saved sk_error_report()

This patch changes the handling of calling the original
sk_error_report() by not putting it on the stack and calling it later.
If the listen_sock.sk_error_report() is NULL in this moment it indicates
a bug in our implementation.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# e9dd5fd8 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: use sock2con without checking null

This patch removes null checks on private data for sockets. If we have a
null dereference there we having a bug in our implementation that such
callback occurs in this state.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 6f0b0b5d 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: remove dlm_node_addrs lookup list

This patch merges the dlm_node_addrs lookup list to the connection
structure. It is a per node mapping to some configuration setup by
configfs. We don't need two lookup structures. The connection hash has
now a lifetime like the dlm_node_addrs entries. Means we add only new
entries when configure cluster and not while new connections are coming
in, remove connection when a node got fenced and cleanup all connection
when the dlm exits. It should work the same and even will show more
issues because we don't try to somehow keep those two data structures in
sync with the current cluster configuration.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# c51c9cd8 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: don't put dlm_local_addrs on heap

This patch removes to allocate the dlm_local_addr[] pointers on the
heap. Instead we directly store the type of "struct sockaddr_storage".
This removes function deinit_local() because it was freeing memory only.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# c3d88dfd 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: cleanup listen sock handling

This patch removes save_listen_callbacks() and add_listen_sock() as they
are only used once in lowcomms functionality. For shutdown lowcomms it's
not necessary to whole flush the workqueues to synchronize with
restoring the old sk_data_ready() callback. Only the listen con receive
work need to be cancelled. For each individual node shutdown we should be
sure that last ack was been transmitted which is done by flushing per
connection swork worker.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 4f567acb 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: remove socket shutdown handling

Since commit 489d8e559c65 ("fs: dlm: add reliable connection if
reconnect") we have functionality like TCP offers for half-closed
sockets on dlm application protocol layer. This feature is required
because the cluster manager events about leaving resource memberships
can be locally already occurred but other cluster nodes having a pending
leaving membership over the cluster manager protocol happening. In this
time the local dlm node already shutdown it's connection and don't
transmit anymore any new dlm messages, but however it still needs to be
able to accept dlm messages because the pending leave membership request
of the cluster manager protocol which the dlm kernel implementation has
no control about it.

We have this functionality on the application for two reasons, the main
reason is that SCTP does not support such functionality on socket
layer. But we can do it inside application layer.

Another small issue is that this feature is broken in the TCP world
because some NAT devices does not implement such functionality
correctly. This is the same reason why the reliable connection session
layer in DLM exists. We give up on middle devices in the networking
which sends e.g. TCP resets out. In DLM we cannot have any message
dropping and we ensure it over a session layer that it can't happen.

Back to the half-closed grace shutdown handling. It's not necessary
anymore to do it on socket layer (which is only support for TCP sockets)
because we do it on application layer. This patch removes this handling,
if there are still issues then we have a problem on the application
layer for such handling.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1037c2a9 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: use listen sock as dlm running indicator

This patch will switch from dlm_allow_conn to check if dlm lowcomms is
running or not to if we actually have a listen socket set or not. The
list socket will be set and unset in lowcomms start and shutdown
functionality. To synchronize with data_ready() callback we will set the
socket callback to NULL while socket lock is held.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# dd070a56 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: use list_first_entry_or_null

Instead of check on list_empty() we can do the same with
list_first_entry_or_null() and return NULL if the returned value is NULL.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 01ea3d77 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: remove twice INIT_WORK

This patch removed a twice INIT_WORK() functionality. We already doing
this inside of dlm_lowcomms_init() functionality which is called only
once dlm is loaded.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 8b0188b0 17-Nov-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: add midcomms init/start functions

This patch introduces leftovers of init, start, stop and exit
functionality. The dlm application layer should always call the midcomms
layer which getting aware of such event and redirect it to the lowcomms
layer. Some functionality which is currently handled inside the start
functionality of midcomms and lowcomms should be handled in the init
functionality as it only need to be initialized once when dlm is loaded.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 194a3fb4 27-Oct-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: relax sending to allow receiving

This patch drops additionally the sock_mutex when there is a sending
message burst. Since we have acknowledge handling we free sending
buffers only when we receive an ack back, but if we are stuck in
send_to_sock() looping because dlm sends a lot of messages and we never
leave the loop the sending buffer fill up very quickly. We can't receive
during this iteration because the sock_mutex is held. This patch will
unlock the sock_mutex so it should be possible to receive messages when
a burst of sending messages happens. This will allow to free up memory
because acks which are already received can be processed.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# f0f4bb43 27-Oct-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: retry accept() until -EAGAIN or error returns

This patch fixes a race if we get two times an socket data ready event
while the listen connection worker is queued. Currently it will be
served only once but we need to do it (in this case twice) until we hit
-EAGAIN which tells us there is no pending accept going on.

This patch wraps an do while loop until we receive a return value which
is different than 0 as it was done before commit d11ccd451b65 ("fs: dlm:
listen socket out of connection hash").

Cc: stable@vger.kernel.org
Fixes: d11ccd451b65 ("fs: dlm: listen socket out of connection hash")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 08ae0547 27-Oct-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix sock release if listen fails

This patch fixes a double sock_release() call when the listen() is
called for the dlm lowcomms listen socket. The caller of
dlm_listen_for_all should never care about releasing the socket if
dlm_listen_for_all() fails, it's done now only once if listen() fails.

Cc: stable@vger.kernel.org
Fixes: 2dc6b1158c28 ("fs: dlm: introduce generic listen")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 30ea3257 15-Aug-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix race in lowcomms

This patch fixes a race between queue_work() in
_dlm_lowcomms_commit_msg() and srcu_read_unlock(). The queue_work() can
take the final reference of a dlm_msg and so msg->idx can contain
garbage which is signaled by the following warning:

[ 676.237050] ------------[ cut here ]------------
[ 676.237052] WARNING: CPU: 0 PID: 1060 at include/linux/srcu.h:189 dlm_lowcomms_commit_msg+0x41/0x50
[ 676.238945] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common iTCO_wdt iTCO_vendor_support qxl kvm_intel drm_ttm_helper vmw_vsock_virtio_transport kvm vmw_vsock_virtio_transport_common ttm irqbypass crc32_pclmul joydev crc32c_intel serio_raw drm_kms_helper vsock virtio_scsi virtio_console virtio_balloon snd_pcm drm syscopyarea sysfillrect sysimgblt snd_timer fb_sys_fops i2c_i801 lpc_ich snd i2c_smbus soundcore pcspkr
[ 676.244227] CPU: 0 PID: 1060 Comm: lock_torture_wr Not tainted 5.19.0-rc3+ #1546
[ 676.245216] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014
[ 676.246460] RIP: 0010:dlm_lowcomms_commit_msg+0x41/0x50
[ 676.247132] Code: fe ff ff ff 75 24 48 c7 c6 bd 0f 49 bb 48 c7 c7 38 7c 01 bd e8 00 e7 ca ff 89 de 48 c7 c7 60 78 01 bd e8 42 3d cd ff 5b 5d c3 <0f> 0b eb d8 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
[ 676.249253] RSP: 0018:ffffa401c18ffc68 EFLAGS: 00010282
[ 676.249855] RAX: 0000000000000001 RBX: 00000000ffff8b76 RCX: 0000000000000006
[ 676.250713] RDX: 0000000000000000 RSI: ffffffffbccf3a10 RDI: ffffffffbcc7b62e
[ 676.251610] RBP: ffffa401c18ffc70 R08: 0000000000000001 R09: 0000000000000001
[ 676.252481] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000005
[ 676.253421] R13: ffff8b76786ec370 R14: ffff8b76786ec370 R15: ffff8b76786ec480
[ 676.254257] FS: 0000000000000000(0000) GS:ffff8b7777800000(0000) knlGS:0000000000000000
[ 676.255239] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 676.255897] CR2: 00005590205d88b8 CR3: 000000017656c003 CR4: 0000000000770ee0
[ 676.256734] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 676.257567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 676.258397] PKRU: 55555554
[ 676.258729] Call Trace:
[ 676.259063] <TASK>
[ 676.259354] dlm_midcomms_commit_mhandle+0xcc/0x110
[ 676.259964] queue_bast+0x8b/0xb0
[ 676.260423] grant_pending_locks+0x166/0x1b0
[ 676.261007] _unlock_lock+0x75/0x90
[ 676.261469] unlock_lock.isra.57+0x62/0xa0
[ 676.262009] dlm_unlock+0x21e/0x330
[ 676.262457] ? lock_torture_stats+0x80/0x80 [dlm_locktorture]
[ 676.263183] torture_unlock+0x5a/0x90 [dlm_locktorture]
[ 676.263815] ? preempt_count_sub+0xba/0x100
[ 676.264361] ? complete+0x1d/0x60
[ 676.264777] lock_torture_writer+0xb8/0x150 [dlm_locktorture]
[ 676.265555] kthread+0x10a/0x130
[ 676.266007] ? kthread_complete_and_exit+0x20/0x20
[ 676.266616] ret_from_fork+0x22/0x30
[ 676.267097] </TASK>
[ 676.267381] irq event stamp: 9579855
[ 676.267824] hardirqs last enabled at (9579863): [<ffffffffbb14e6f8>] __up_console_sem+0x58/0x60
[ 676.268896] hardirqs last disabled at (9579872): [<ffffffffbb14e6dd>] __up_console_sem+0x3d/0x60
[ 676.270008] softirqs last enabled at (9579798): [<ffffffffbc200349>] __do_softirq+0x349/0x4c7
[ 676.271438] softirqs last disabled at (9579897): [<ffffffffbb0d54c0>] irq_exit_rcu+0xb0/0xf0
[ 676.272796] ---[ end trace 0000000000000000 ]---

I reproduced this warning with dlm_locktorture test which is currently
not upstream. However this patch fix the issue by make a additional
refcount between dlm_lowcomms_new_msg() and dlm_lowcomms_commit_msg().
In case of the race the kref_put() in dlm_lowcomms_commit_msg() will be
the final put.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# dfc020f3 22-Jun-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix grammar in lowcomms output

This patch fixes some grammar output in lowcomms implementation by
removing the "successful" word which should be "successfully" but it
can never be unsuccessfully so we remove it.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# a8449f23 04-Apr-2022 Alexander Aring <aahringo@redhat.com>

dlm: add __CHECKER__ for false positives

This patch will adds #ifndef __CHECKER__ for false positives warnings
about an imbalance lock/unlock srcu handling. Which are shown by running
sparse checks:

fs/dlm/midcomms.c:1065:20: warning: context imbalance in 'dlm_midcomms_get_mhandle' - wrong count at exit

Using __CHECKER__ will tell sparse to ignore these sections.

Those imbalances are false positive because from upper layer it is
always required to call a function in sequence, e.g. if
dlm_midcomms_get_mhandle() is successful there must be a
dlm_midcomms_commit_mhandle() call afterwards.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1f4f1084 04-Apr-2022 Dan Carpenter <dan.carpenter@oracle.com>

dlm: uninitialized variable on error in dlm_listen_for_all()

The "sock" variable is not initialized on this error path.

Cc: stable@vger.kernel.org
Fixes: 2dc6b1158c28 ("fs: dlm: introduce generic listen")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# feae43f8 04-Jan-2022 Alexander Aring <aahringo@redhat.com>

fs: dlm: print cluster addr if non-cluster node connects

This patch prints the cluster node address if a non-cluster node
(according to the dlm config setting) tries to connect. The current
hexdump call will print in a different loglevel and only available if
dynamic debug is enabled. Additional we using the ip address format
strings to print an IETF ip4/6 string represenation.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# e4dc81ed 30-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: memory cache for lowcomms hotpath

This patch introduces a kmem cache for dlm_msg handles which are used
always if dlm sends a message out. Even if their are covered by midcomms
layer or not.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 3af2326c 30-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: memory cache for writequeue_entry

This patch introduces a kmem cache for writequeue entry. A writequeue
entry get quite a lot allocated if dlm transmit messages.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# be3b0400 30-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: remove wq_alloc mutex

This patch cleanups the code for allocating a new buffer in the dlm
writequeue mechanism. There was a possible tuneup to allow scheduling
while a new writequeue entry needs to be allocated because either no
sending page is available or are full. To avoid multiple concurrent
users checking at the same time if an entry is available or full
alloc_wq was introduce that those are waiting if there is currently a
new writequeue entry in process to be queued so possible further users
will check on the new allocated writequeue entry if it's full.

To simplify the code we just remove this mutex and switch that the
already introduced spin lock will be held during writequeue check,
allocation and queueing. So other users can never check on available
writequeues while there is a new one in process but not queued yet.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# bcbfea41 30-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: check for pending users filling buffers

Currently we don't care if the DLM application stack is filling buffers
(not committed yet) while we transmit some already committed buffers.
By checking on active writequeue users before dequeue a writequeue entry
we know there is coming more data and do nothing. We wait until the send
worker will be triggered again if the writequeue entry users hit zero.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1b9beda8 17-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix build with CONFIG_IPV6 disabled

This patch will surround the AF_INET6 case in sk_error_report() of dlm
with a #if IS_ENABLED(CONFIG_IPV6). The field sk->sk_v6_daddr is not
defined when CONFIG_IPV6 is disabled. If CONFIG_IPV6 is disabled, the
socket creation with AF_INET6 should already fail because a runtime
check if AF_INET6 is registered. However if there is the possibility
that AF_INET6 is set as sk_family the sk_error_report() callback will
print then an invalid family type error.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 4c3d90570bcc ("fs: dlm: don't call kernel_getpeername() in error_report()")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 92c44605 15-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: replace use of socket sk_callback_lock with sock_lock

This patch will replace the use of socket sk_callback_lock lock and uses
socket lock instead. Some users like sunrpc, see commit ea9afca88bbe
("SUNRPC: Replace use of socket sk_callback_lock with sock_lock") moving
from sk_callback_lock to sock_lock which seems to be held when the socket
callbacks are called.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 4c3d9057 15-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: don't call kernel_getpeername() in error_report()

In some cases kernel_getpeername() will held the socket lock which is
already held when the socket layer calls error_report() callback. Since
commit 9dfc685e0262 ("inet: remove races in inet{6}_getname()") this
problem becomes more likely because the socket lock will be held always.
You will see something like:

bob9-u5 login: [ 562.316860] BUG: spinlock recursion on CPU#7, swapper/7/0
[ 562.318562] lock: 0xffff8f2284720088, .magic: dead4ead, .owner: swapper/7/0, .owner_cpu: 7
[ 562.319522] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.15.0+ #135
[ 562.320346] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
[ 562.321277] Call Trace:
[ 562.321529] <IRQ>
[ 562.321734] dump_stack_lvl+0x33/0x42
[ 562.322282] do_raw_spin_lock+0x8b/0xc0
[ 562.322674] lock_sock_nested+0x1e/0x50
[ 562.323057] inet_getname+0x39/0x110
[ 562.323425] ? sock_def_readable+0x80/0x80
[ 562.323838] lowcomms_error_report+0x63/0x260 [dlm]
[ 562.324338] ? wait_for_completion_interruptible_timeout+0xd2/0x120
[ 562.324949] ? lock_timer_base+0x67/0x80
[ 562.325330] ? do_raw_spin_unlock+0x49/0xc0
[ 562.325735] ? _raw_spin_unlock_irqrestore+0x1e/0x40
[ 562.326218] ? del_timer+0x54/0x80
[ 562.326549] sk_error_report+0x12/0x70
[ 562.326919] tcp_validate_incoming+0x3c8/0x530
[ 562.327347] ? kvm_clock_read+0x14/0x30
[ 562.327718] ? ktime_get+0x3b/0xa0
[ 562.328055] tcp_rcv_established+0x121/0x660
[ 562.328466] tcp_v4_do_rcv+0x132/0x260
[ 562.328835] tcp_v4_rcv+0xcea/0xe20
[ 562.329173] ip_protocol_deliver_rcu+0x35/0x1f0
[ 562.329615] ip_local_deliver_finish+0x54/0x60
[ 562.330050] ip_local_deliver+0xf7/0x110
[ 562.330431] ? inet_rtm_getroute+0x211/0x840
[ 562.330848] ? ip_protocol_deliver_rcu+0x1f0/0x1f0
[ 562.331310] ip_rcv+0xe1/0xf0
[ 562.331603] ? ip_local_deliver+0x110/0x110
[ 562.332011] __netif_receive_skb_core+0x46a/0x1040
[ 562.332476] ? inet_gro_receive+0x263/0x2e0
[ 562.332885] __netif_receive_skb_list_core+0x13b/0x2c0
[ 562.333383] netif_receive_skb_list_internal+0x1c8/0x2f0
[ 562.333896] ? update_load_avg+0x7e/0x5e0
[ 562.334285] gro_normal_list.part.149+0x19/0x40
[ 562.334722] napi_complete_done+0x67/0x160
[ 562.335134] virtnet_poll+0x2ad/0x408 [virtio_net]
[ 562.335644] __napi_poll+0x28/0x140
[ 562.336012] net_rx_action+0x23d/0x300
[ 562.336414] __do_softirq+0xf2/0x2ea
[ 562.336803] irq_exit_rcu+0xc1/0xf0
[ 562.337173] common_interrupt+0xb9/0xd0

It is and was always forbidden to call kernel_getpeername() in context
of error_report(). To get rid of the problem we access the destination
address for the peer over the socket structure. While on it we fix to
print out the destination port of the inet socket.

Fixes: 1a31833d085a ("DLM: Replace nodeid_to_addr with kernel_getpeername")
Reported-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# b87b1883 03-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: remove double list_first_entry call

This patch removes a list_first_entry() call which is already done by
the previous con_next_wq() call.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 5c16febb 02-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: let handle callback data as void

This patch changes the dlm_lowcomms_new_msg() function pointer private data
from "struct mhandle *" to "void *" to provide different structures than
just "struct mhandle".

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 92732376 02-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: trace socket handling

This patch adds tracepoints for dlm socket receive and send
functionality. We can use it to track how much data was send or received
to or from a specific nodeid.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# fe933675 02-Nov-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: remove check SCTP is loaded message

Since commit 764ff4011424 ("fs: dlm: auto load sctp module") we try
load the sctp module before we try to create a sctp kernel socket. That
a socket creation fails now has more likely other reasons. This patch
removes the part of error to load the sctp module and instead printout
the error code.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# b97f8525 18-Aug-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: implement delayed ack handling

This patch changes that we don't ack each message. Lowcomms will take
care about to send an ack back after a bulk of messages was processed.
Currently it's only when the whole receive buffer was processed, there
might better positions to send an ack back but only the lowcomms
implementation know when there are more data to receive. This patch has
also disadvantages that we might retransmit more on errors, however this
is a very rare case.

Tested with make_panic on gfs2 with three nodes by running:

trace-cmd record -p function -l 'dlm_send_ack' sleep 100

and

trace-cmd report | wc -l

Before patch:
- 20548
- 21376
- 21398

After patch:
- 18338
- 20679
- 19949

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 62699b3f 16-Jul-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: move receive loop into receive handler

This patch moves the kernel_recvmsg() loop call into the
receive_from_sock() function instead of doing the loop outside the
function and abort the loop over it's return value.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# c51b0221 16-Jul-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix multiple empty writequeue alloc

This patch will add a mutex that a connection can allocate a writequeue
entry buffer only at a sleepable context at one time. If multiple caller
waits at the writequeue spinlock and the spinlock gets release it could
be that multiple new writequeue page buffers were allocated instead of
allocate one writequeue page buffer and other waiters will use remaining
buffer of it. It will only be the case for sleepable context which is
the common case. In non-sleepable contexts like retransmission we just
don't care about such behaviour.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 8728a455 16-Jul-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: generic connect func

This patch adds a generic connect function for TCP and SCTP. If the
connect functionality differs from each other additional callbacks in
dlm_proto_ops were added. The sockopts callback handling will guarantee
that sockets created by connect() will use the same options as sockets
created by accept().

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 90d21fc0 16-Jul-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: auto load sctp module

This patch adds a "for now" better handling of missing SCTP support in
the kernel and try to load the sctp module if SCTP is set.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 2dc6b115 16-Jul-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: introduce generic listen

This patch combines each transport layer listen functionality into one
listen function. Per transport layer differences are provided by
additional callbacks in dlm_proto_ops.

This patch drops silently sock_set_keepalive() for listen tcp sockets
only. This socket option is not set at connecting sockets, I also don't
see the sense of set keepalive for sockets which are created by accept()
only.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# a66c008c 16-Jul-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: move to static proto ops

This patch moves the per transport socket callbacks to a static const
array. We can support only one transport socket for the init namespace
which will be determinted by reading the dlm config at lowcomms_start().

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 66d5955a 16-Jul-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: introduce con_next_wq helper

This patch introduce a function to determine if something is ready to
being send in the writequeue. It's not just that the writequeue is not
empty additional the first entry need to have a valid length field.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 052849be 16-Jul-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: clear CF_APP_LIMITED on close

If send_to_sock() sets CF_APP_LIMITED limited bit and it has not been
cleared by a waiting lowcomms_write_space() yet and a close_connection()
apprears we should clear the CF_APP_LIMITED bit again because the
connection starts from a new state again at reconnect.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# feb704bd 16-Jul-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: use sk->sk_socket instead of con->sock

Instead of dereference "con->sock" we can get the socket structure over
"sk->sk_socket" as well. This patch will switch to this behaviour.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# d10a0b88 02-Jun-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: rename socket and app buffer defines

This patch renames DEFAULT_BUFFER_SIZE to DLM_MAX_SOCKET_BUFSIZE and
LOWCOMMS_MAX_TX_BUFFER_LEN to DLM_MAX_APP_BUFSIZE as they are proper
names to define what's behind those values. The DLM_MAX_SOCKET_BUFSIZE
defines the maximum size of buffer which can be handled on socket layer,
the DLM_MAX_APP_BUFSIZE defines the maximum size of buffer which can be
handled by the DLM application layer.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# ac7d5d03 02-Jun-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: introduce proto values

Currently the dlm protocol values are that TCP is 0 and everything else
is SCTP. This makes it difficult to introduce possible other transport
layers. The only one user space tool dlm_controld, which I am aware of,
handles the protocol value 1 for SCTP. We change it now to handle SCTP
as 1, this will break user space API but it will fix it so we can add
possible other transport layers.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 9a4139a7 02-Jun-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: move dlm allow conn

This patch checks if possible allowing new connections is allowed before
queueing the listen socket to accept new connections.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 6c6a1cc6 02-Jun-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: use alloc_ordered_workqueue

The proper way to allocate ordered workqueues is to use
alloc_ordered_workqueue() function. The current way implies an ordered
workqueue which is also required by dlm.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# fcef0e6c 02-Jun-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix lowcomms_start error case

This patch fixes the error path handling in lowcomms_start(). We need to
cleanup some static allocated data structure and cleanup possible
workqueue if these have started.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 706474fb 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: don't allow half transmitted messages

This patch will clean a dirty page buffer if a reconnect occurs. If a page
buffer was half transmitted we cannot start inside the middle of a dlm
message if a node connects again. I observed invalid length receptions
errors and was guessing that this behaviour occurs, after this patch I
never saw an invalid message length again. This patch might drops more
messages for dlm version 3.1 but 3.1 can't deal with half messages as
well, for 3.2 it might trigger more re-transmissions but will not leave dlm
in a broken state.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 489d8e55 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: add reliable connection if reconnect

This patch introduce to make a tcp lowcomms connection reliable even if
reconnects occurs. This is done by an application layer re-transmission
handling and sequence numbers in dlm protocols. There are three new dlm
commands:

DLM_OPTS:

This will encapsulate an existing dlm message (and rcom message if they
don't have an own application side re-transmission handling). As optional
handling additional tlv's (type length fields) can be appended. This can
be for example a sequence number field. However because in DLM_OPTS the
lockspace field is unused and a sequence number is a mandatory field it
isn't made as a tlv and we put the sequence number inside the lockspace
id. The possibility to add optional options are still there for future
purposes.

DLM_ACK:

Just a dlm header to acknowledge the receive of a DLM_OPTS message to
it's sender.

DLM_FIN:

This provides a 4 way handshake for connection termination inclusive
support for half-closed connections. It's provided on application layer
because SCTP doesn't support half-closed sockets, the shutdown() call
can interrupted by e.g. TCP resets itself and a hard logic to implement
it because the othercon paradigm in lowcomms. The 4-way termination
handshake also solve problems to synchronize peer EOF arrival and that
the cluster manager removes the peer in the node membership handling of
DLM. In some cases messages can be still transmitted in this time and we
need to wait for the node membership event.

To provide a reliable connection the node will retransmit all
unacknowledges message to it's peer on reconnect. The receiver will then
filtering out the next received message and drop all messages which are
duplicates.

As RCOM_STATUS and RCOM_NAMES messages are the first messages which are
exchanged and they have they own re-transmission handling, there exists
logic that these messages must be first. If these messages arrives we
store the dlm version field. This handling is on DLM 3.1 and after this
patch 3.2 the same. A backwards compatibility handling has been added
which seems to work on tests without tcpkill, however it's not recommended
to use DLM 3.1 and 3.2 at the same time, because DLM 3.2 tries to fix long
term bugs in the DLM protocol.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 37a247da 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: move out some hash functionality

This patch moves out some lowcomms hash functionality into lowcomms
header to provide them to other layers like midcomms as well.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 2874d1a6 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: add functionality to re-transmit a message

This patch introduces a retransmit functionality for a lowcomms message
handle. It's just allocates a new buffer and transmit it again, no
special handling about prioritize it because keeping bytestream in order.

To avoid another connection look some refactor was done to make a new
buffer allocation with a preexisting connection pointer.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 8f2dc78d 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: make buffer handling per msg

This patch makes the void pointer handle for lowcomms functionality per
message and not per page allocation entry. A refcount handling for the
handle was added to keep the message alive until the user doesn't need
it anymore.

There exists now a per message callback which will be called when
allocating a new buffer. This callback will be guaranteed to be called
according the order of the sending buffer, which can be used that the
caller increments a sequence number for the dlm message handle.

For transition process we cast the dlm_mhandle to dlm_msg and vice versa
until the midcomms layer will implement a specific dlm_mhandle structure.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 8aa31cbf 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix connection tcp EOF handling

This patch fixes the EOF handling for TCP that if and EOF is received we
will close the socket next time the writequeue runs empty. This is a
half-closed socket functionality which doesn't exists in SCTP. The
midcomms layer will do a half closed socket functionality on DLM side to
solve this problem for the SCTP case. However there is still the last ack
flying around but other reset functionality will take care of it if it got
lost.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# c6aa00e3 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: cancel work sync othercon

These rx tx flags arguments are for signaling close_connection() from
which worker they are called. Obviously the receive worker cannot cancel
itself and vice versa for swork. For the othercon the receive worker
should only be used, however to avoid deadlocks we should pass the same
flags as the original close_connection() was called.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# ba868d9d 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: reconnect if socket error report occurs

This patch will change the reconnect handling that if an error occurs
if a socket error callback is occurred. This will also handle reconnects
in a non blocking connecting case which is currently missing. If error
ECONNREFUSED is reported we delay the reconnect by one second.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 7443bc96 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: set is othercon flag

There is a is othercon flag which is never used, this patch will set it
and printout a warning if the othercon ever sends a dlm message which
should never be the case.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# b38bc9c2 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix srcu read lock usage

This patch holds the srcu connection read lock in cases where we lookup
the connections and accessing it. We don't hold the srcu lock in workers
function where the scheduled worker is part of the connection itself.
The connection should not be freed if any worker is scheduled or
pending.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 2df6b762 21-May-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: add dlm macros for ratelimit log

This patch add ratelimit macro to dlm subsystem and will set the
connecting log message to ratelimit. In non blocking connecting cases it
will print out this message a lot.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 2fd8db2d 27-Mar-2021 Yang Yingliang <yangyingliang@huawei.com>

fs: dlm: fix missing unlock on error in accept_from_sock()

Add the missing unlock before return from accept_from_sock()
in the error handling case.

Fixes: 6cde210a9758 ("fs: dlm: add helper for init connection")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 9d232469 01-Mar-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: add shutdown hook

This patch fixes issues which occurs when dlm lowcomms synchronize their
workqueues but dlm application layer already released the lockspace. In
such cases messages like:

dlm: gfs2: release_lockspace final free
dlm: invalid lockspace 3841231384 from 1 cmd 1 type 11

are printed on the kernel log. This patch is solving this issue by
introducing a new "shutdown" hook before calling "stop" hook when the
lockspace is going to be released finally. This should pretend any
dlm messages sitting in the workqueues during or after lockspace
removal.

It's necessary to call dlm_scand_stop() as I instrumented
dlm_lowcomms_get_buffer() code to report a warning after it's called after
dlm_midcomms_shutdown() functionality, see below:

WARNING: CPU: 1 PID: 3794 at fs/dlm/midcomms.c:1003 dlm_midcomms_get_buffer+0x167/0x180
Modules linked in: joydev iTCO_wdt intel_pmc_bxt iTCO_vendor_support drm_ttm_helper ttm pcspkr serio_raw i2c_i801 i2c_smbus drm_kms_helper virtio_scsi lpc_ich virtio_balloon virtio_console xhci_pci xhci_pci_renesas cec qemu_fw_cfg drm [last unloaded: qxl]
CPU: 1 PID: 3794 Comm: dlm_scand Tainted: G W 5.11.0+ #26
Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
RIP: 0010:dlm_midcomms_get_buffer+0x167/0x180
Code: 5d 41 5c 41 5d 41 5e 41 5f c3 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d 41 5e 41 5f c3 4c 89 e7 45 31 e4 e8 3b f1 ec ff eb 86 <0f> 0b 4c 89 e7 45 31 e4 e8 2c f1 ec ff e9 74 ff ff ff 0f 1f 80 00
RSP: 0018:ffffa81503f8fe60 EFLAGS: 00010202
RAX: 0000000000000008 RBX: ffff8f969827f200 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffffffad1e89a0 RDI: ffff8f96a5294160
RBP: 0000000000000001 R08: 0000000000000000 R09: ffff8f96a250bc60
R10: 00000000000045d3 R11: 0000000000000000 R12: ffff8f96a250bc60
R13: ffffa81503f8fec8 R14: 0000000000000070 R15: 0000000000000c40
FS: 0000000000000000(0000) GS:ffff8f96fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055aa3351c000 CR3: 000000010bf22000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
dlm_scan_rsbs+0x420/0x670
? dlm_uevent+0x20/0x20
dlm_scand+0xbf/0xe0
kthread+0x13a/0x150
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30

To synchronize all dlm scand messages we stop it right before shutdown
hook.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# eec054b5 01-Mar-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: flush swork on shutdown

This patch fixes the flushing of send work before shutdown. The function
cancel_work_sync() is not the right workqueue functionality to use here
as it would cancel the work if the work queues itself. In cases of
EAGAIN in send() for dlm message we need to be sure that everything is
send out before. The function flush_work() will ensure that every send
work is be done inclusive in EAGAIN cases.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# f0747ebf 01-Mar-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: simplify writequeue handling

This patch cleans up the current dlm sending allocator handling by using
some named macros, list functionality and removes some goto statements.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# e1a7cbce 01-Mar-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: use GFP_ZERO for page buffer

This patch uses GFP_ZERO for allocate a page for the internal dlm
sending buffer allocator instead of calling memset zero after every
allocation. An already allocated space will never be reused again.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# c45674fb 01-Mar-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: change allocation limits

While running tcpkill I experienced invalid header length values while
receiving to check that a node doesn't try to send a invalid dlm message
we also check on applications minimum allocation limit. Also use
DEFAULT_BUFFER_SIZE as maximum allocation limit. The define
LOWCOMMS_MAX_TX_BUFFER_LEN is to calculate maximum buffer limits on
application layer, future midcomms layer will subtract their needs from
this define.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 51746163 01-Mar-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: add check if dlm is currently running

This patch adds checks for dlm config attributes regarding to protocol
parameters as it makes only sense to change them when dlm is not running.
It also adds a check for valid protocol specifiers and return invalid
argument if they are not supported.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# e9a470ac 01-Mar-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: set subclass for othercon sock_mutex

This patch sets the lockdep subclass for the othercon socket mutex. In
various places the connection socket mutex is held while locking the
othercon socket mutex. This patch will remove lockdep warnings when such
case occurs.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# b30a624f 01-Mar-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: set connected bit after accept

This patch sets the CF_CONNECTED bit when dlm accepts a connection from
another node. If we don't set this bit, next time if the connection
socket gets writable it will assume an event that the connection is
successfully connected. However that is only the case when the
connection did a connect.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# e125fbeb 01-Mar-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix mark setting deadlock

This patch fixes an deadlock issue when dlm_lowcomms_close() is called.
When dlm_lowcomms_close() is called the clusters_root.subsys.su_mutex is
held to remove configfs items. At this time we flushing (e.g.
cancel_work_sync()) the workers of send and recv workqueue. Due the fact
that we accessing configfs items (mark values), these workers will lock
clusters_root.subsys.su_mutex as well which are already hold by
dlm_lowcomms_close() and ends in a deadlock situation.

[67170.703046] ======================================================
[67170.703965] WARNING: possible circular locking dependency detected
[67170.704758] 5.11.0-rc4+ #22 Tainted: G W
[67170.705433] ------------------------------------------------------
[67170.706228] dlm_controld/280 is trying to acquire lock:
[67170.706915] ffff9f2f475a6948 ((wq_completion)dlm_recv){+.+.}-{0:0}, at: __flush_work+0x203/0x4c0
[67170.708026]
but task is already holding lock:
[67170.708758] ffffffffa132f878 (&clusters_root.subsys.su_mutex){+.+.}-{3:3}, at: configfs_rmdir+0x29b/0x310
[67170.710016]
which lock already depends on the new lock.

The new behaviour adds the mark value to the node address configuration
which doesn't require to held the clusters_root.subsys.su_mutex by
accessing mark values in a separate datastructure. However the mark
values can be set now only after a node address was set which is the
case when the user is using dlm_controld.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 4f19d071 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: check on existing node address

This patch checks if we add twice the same address to a per node address
array. This should never be the case and we report -EEXIST to the user
space.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 40c6b83e 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: constify addr_compare

This patch just constify some function parameter which should be have a
read access only.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1a26bfaf 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix check for multi-homed hosts

This patch will use the runtime array size dlm_local_count variable
to check the actual size of the dlm_local_addr array. There exists
currently a cleanup bug, because the tcp_listen_for_all() functionality
might check on a dangled pointer.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# d11ccd45 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: listen socket out of connection hash

This patch introduces a own connection structure for the listen socket
handling instead of handling the listen socket as normal connection
structure in the connection hash. We can remove some nodeid equals zero
validation checks, because this nodeid should not exists anymore inside
the node hash. This patch also removes the sock mutex in
accept_from_sock() function because this function can't occur in another
parallel context if it's scheduled on only one workqueue.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 13004e8a 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: refactor sctp sock parameter

This patch refactors sctp_bind_addrs() to work with a socket parameter
instead of a connection parameter.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 42873c90 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: move shutdown action to node creation

This patch move the assignment for the shutdown action callback to the
node creation functionality.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 0672c3c2 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: move connect callback in node creation

This patch moves the assignment for the connect callback to the node
creation instead of assign some dummy functionality. The assignment
which connect functionality will be used will be detected according to
the configfs setting.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 6cde210a 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: add helper for init connection

This patch will move the connection structure initialization into an
own function. This avoids cases to update the othercon initialization.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 19633c7e 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: handle non blocked connect event

The manpage of connect shows that in non blocked mode a writeability
indicates successful connection event. This patch is handling this event
inside the writeability callback. In case of SCTP we use blocking
connect functionality which indicates a successful connect when the
function returns with a successful return value.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 53a5edaa 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: flush othercon at close

This patch ensures we also flush the othercon writequeue when a lowcomms
close occurs.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 692f51c8 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: add get buffer error handling

This patch adds an error handling to the get buffer functionality if the
user is requesting a buffer length which is more than possible of
the internal buffer allocator. This should never happen because specific
handling decided by compile time, but will warn if somebody forget about
to handle this limitation right.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 5cbec208 02-Nov-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix proper srcu api call

This patch will use call_srcu() instead of call_rcu() because the
related datastructure resource are handled under srcu context. I assume
the current code is fine anyway since free_conn() must be called when
the related resource are not in use otherwise. However it will correct
the overall handling in a srcu context.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 4f2b30fd 30-Sep-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix race in nodeid2con

This patch fixes a race in nodeid2con in cases that we parallel running
a lookup and both will create a connection structure for the same nodeid.
It's a rare case to create a new connection structure to keep reader
lockless we just do a lookup inside the protection area again and drop
previous work if this race happens.

Fixes: a47666eb763cc ("fs: dlm: make connection hash lockless")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 4798cbbf 24-Sep-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: rework receive handling

This patch reworks the current receive handling of dlm. As I tried to
change the send handling to fix reorder issues I took a look into the
receive handling and simplified it, it works as the following:

Each connection has a preallocated receive buffer with a minimum length of
4096. On receive, the upper layer protocol will process all dlm message
until there is not enough data anymore. If there exists "leftover" data at
the end of the receive buffer because the dlm message wasn't fully received
it will be copied to the begin of the preallocated receive buffer. Next
receive more data will be appended to the previous "leftover" data and
processing will begin again.

This will remove a lot of code of the current mechanism. Inside the
processing functionality we will ensure with a memmove() that the dlm
message should be memory aligned. To have a dlm message always started
at the beginning of the buffer will reduce some amount of memmove()
calls because src and dest pointers are the same.

The cluster attribute "buffer_size" becomes a new meaning, it's now the
size of application layer receive buffer size. If this is changed during
runtime the receive buffer will be reallocated. It's important that the
receive buffer size has at minimum the size of the maximum possible dlm
message size otherwise the received message cannot be placed inside
the receive buffer size.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 3f78cd7d 24-Sep-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix mark per nodeid setting

This patch fixes to set per nodeid mark configuration for accepted
sockets as well. Before this patch only the listen socket mark value was
used for all accepted connections. This patch will ensure that the
cluster mark attribute value will be always used for all sockets, if a
per nodeid mark value is specified dlm will use this value for the
specific node.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 0461e0db 24-Sep-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: remove lock dependency warning

During my experiments to make dlm robust against tcpkill application I
was able to run sometimes in a circular lock dependency warning between
clusters_root.subsys.su_mutex and con->sock_mutex. We don't need to
held the sock_mutex when getting the mark value which held the
clusters_root.subsys.su_mutex. This patch moves the specific handling
just before the sock_mutex will be held.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 7ae0451e 27-Aug-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: use free_con to free connection

This patch use free_con() functionality to free the listen connection if
listen fails. It also fixes an issue that a freed resource is still part
of the connection_hash as hlist_del() is not called in this case. The
only difference is that free_con() handles othercon as well, but this is
never been set for the listen connection.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 948c47e9 27-Aug-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: handle possible othercon writequeues

This patch adds free of possible other writequeue entries in othercon
member of struct connection.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 0de98432 27-Aug-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: move free writequeue into con free

This patch just move the free of struct connection member writequeue
into the functionality when struct connection will be freed instead of
doing two iterations.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 043697f0 27-Aug-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix dlm_local_addr memory leak

This patch fixes the following memory detected by kmemleak and umount
gfs2 filesystem which removed the last lockspace:

unreferenced object 0xffff9264f4f48f00 (size 128):
comm "mount", pid 425, jiffies 4294690253 (age 48.159s)
hex dump (first 32 bytes):
02 00 52 48 c0 a8 7a fb 00 00 00 00 00 00 00 00 ..RH..z.........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000067a34940>] kmemdup+0x18/0x40
[<00000000c935f9ab>] init_local+0x4c/0xa0
[<00000000bbd286ef>] dlm_lowcomms_start+0x28/0x160
[<00000000a86625cb>] dlm_new_lockspace+0x7e/0xb80
[<000000008df6cd63>] gdlm_mount+0x1cc/0x5de
[<00000000b67df8c7>] gfs2_lm_mount.constprop.0+0x1a3/0x1d3
[<000000006642ac5e>] gfs2_fill_super+0x717/0xba9
[<00000000d3ab7118>] get_tree_bdev+0x17f/0x280
[<000000001975926e>] gfs2_get_tree+0x21/0x90
[<00000000561ce1c4>] vfs_get_tree+0x28/0xc0
[<000000007fecaf63>] path_mount+0x434/0xc00
[<00000000636b9594>] __x64_sys_mount+0xe3/0x120
[<00000000cc478a33>] do_syscall_64+0x33/0x40
[<00000000ce9ccf01>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# a47666eb 27-Aug-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: make connection hash lockless

There are some problems with the connections_lock. During my
experiements I saw sometimes circular dependencies with sock_lock.
The reason here might be code parts which runs nodeid2con() before
or after sock_lock is acquired.

Another issue are missing locks in for_conn() iteration. Maybe this
works fine because for_conn() is running in a context where
connection_hash cannot be manipulated by others anymore.

However this patch changes the connection_hash to be protected by
sleepable rcu. The hotpath function __find_con() is implemented
lockless as it is only a reader of connection_hash and this hopefully
fixes the circular locking dependencies. The iteration for_conn() will
still call some sleepable functionality, that's why we use sleepable rcu
in this case.

This patch removes the kmemcache functionality as I think I need to
make some free() functionality via call_rcu(). However allocation time
isn't here an issue. The dlm_allow_con will not be protected by a lock
anymore as I think it's enough to just set and flush workqueues
afterwards.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# aa7ab1e2 27-Aug-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: synchronize dlm before shutdown

This patch moves the dlm workqueue dlm synchronization before shutdown
handling. The patch just flushes all pending work before starting to
shutdown the connection. At least for the send_workqeue we should flush
the workqueue to make sure there is no new connection handling going on
as dlm_allow_conn switch is turned to false before.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 055923bf 27-Jul-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: implement tcp graceful shutdown

During my code inspection I saw there is no implementation of a graceful
shutdown for tcp. This patch will introduce a graceful shutdown for tcp
connections. The shutdown is implemented synchronized as
dlm_lowcomms_stop() is called to end all dlm communication. After shutdown
is done, a lot of flush and closing functionality will be called. However
I don't see a problem with that.

The waitqueue for synchronize the shutdown has a timeout of 10 seconds, if
timeout a force close will be exectued.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# ba3ab3ca 27-Jul-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: change handling of reconnects

This patch changes the handling of reconnects. At first we only close
the connection related to the communication failure. If we get a new
connection for an already existing connection we close the existing
connection and take the new one.

This patch improves significantly the stability of tcp connections while
running "tcpkill -9 -i $IFACE port 21064" while generating a lot of dlm
messages e.g. on a gfs2 mount with many files. My test setup shows that a
deadlock is "more" unlikely. Before this patch I wasn't able to get
not a deadlock after 5 seconds. After this patch my observation is
that it's more likely to survive after 5 seconds and more, but still a
deadlock occurs after certain time. My guess is that there are still
"segments" inside the tcp writequeue or retransmit queue which get dropped
when receiving a tcp reset [1]. Hard to reproduce because the right message
need to be inside these queues, which might even be in the 5 first seconds
with this patch.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_input.c?h=v5.8-rc6#n4122

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 0ea47e4d 27-Jul-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: don't close socket on invalid message

This patch doesn't close sockets when there is an invalid dlm message
received. The connection will probably reconnect anyway so. To not
close the connection will reduce the number of possible failtures.
As we don't have a different strategy to react on such scenario
just keep going the connection and ignore the message.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 9c9f168f 26-Jun-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: set skb mark per peer socket

This patch adds support to set the skb mark value for the DLM tcp and
sctp socket per peer. The mark value will be offered as per comm value
of configfs. At creation time of the peer socket it will be set as
socket option.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# a5b7ab63 26-Jun-2020 Alexander Aring <aahringo@redhat.com>

fs: dlm: set skb mark for listen socket

This patch adds support to set the skb mark value for the DLM listen
tcp and sctp sockets. The mark value will be offered as cluster
configuration. At creation time of the listen socket it will be set as
socket option.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# c0425a42 29-May-2020 Christoph Hellwig <hch@lst.de>

net: add a new bind_add method

The SCTP protocol allows to bind multiple address to a socket. That
feature is currently only exposed as a socket option. Add a bind_add
method struct proto that allows to bind additional addresses, and
switch the dlm code to use the method instead of going through the
socket option from kernel space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 40ef92c6 29-May-2020 Christoph Hellwig <hch@lst.de>

sctp: add sctp_sock_set_nodelay

Add a helper to directly set the SCTP_NODELAY sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 12abc5ee 27-May-2020 Christoph Hellwig <hch@lst.de>

tcp: add tcp_sock_set_nodelay

Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess. Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 26cfabf9 27-May-2020 Christoph Hellwig <hch@lst.de>

net: add sock_set_rcvbuf

Add a helper to directly set the SO_RCVBUFFORCE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce3d9544 27-May-2020 Christoph Hellwig <hch@lst.de>

net: add sock_set_keepalive

Add a helper to directly set the SO_KEEPALIVE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 76ee0785 27-May-2020 Christoph Hellwig <hch@lst.de>

net: add sock_set_sndtimeo

Add a helper to directly set the SO_SNDTIMEO_NEW sockopt from kernel
space without going through a fake uaccess. The interface is
simplified to only pass the seconds value, as that is the only
thing needed at the moment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b58f0e8f 27-May-2020 Christoph Hellwig <hch@lst.de>

net: add sock_set_reuseaddr

Add a helper to directly set the SO_REUSEADDR sockopt from kernel space
without going through a fake uaccess.

For this the iscsi target now has to formally depend on inet to avoid
a mostly theoretical compile failure. For actual operation it already
did depend on having ipv4 or ipv6 support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0774dc76 27-May-2020 Christoph Hellwig <hch@lst.de>

dlm: use the tcp version of accept_from_sock for sctp as well

The only difference between a few missing fixes applied to the SCTP
one is that TCP uses ->getpeername to get the remote address, while
SCTP uses kernel_getsockopt(.. SCTP_PRIMARY_ADDR). But given that
getpeername is defined to return the primary address for sctp, there
doesn't seem to be any reason for the different way of quering the
peername, or all the code duplication.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5311f707 25-Oct-2019 Arnd Bergmann <arnd@arndb.de>

dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD

Eliminate one more use of 'struct timeval' from the kernel so
we can eventually remove the definition as well.

The kernel supports the new format with a 64-bit time_t version
of timeval here, so use that instead of the old timeval.

Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# b355516f 02-Apr-2019 David Windsor <dwindsor@redhat.com>

dlm: check if workqueues are NULL before flushing/destroying

If the DLM lowcomms stack is shut down before any DLM
traffic can be generated, flush_workqueue() and
destroy_workqueue() can be called on empty send and/or recv
workqueues.

Insert guard conditionals to only call flush_workqueue()
and destroy_workqueue() on workqueues that are not NULL.

Signed-off-by: David Windsor <dwindsor@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 2522fe45 28-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 193

Based on 1 normalized pattern(s):

this copyrighted material is made available to anyone wishing to use
modify copy or redistribute it subject to the terms and conditions
of the gnu general public license v 2

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 45 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528170027.342746075@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 45bdc661 02-Feb-2019 Deepa Dinamani <deepa.kernel@gmail.com>

socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes

SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval
as the time format. struct timeval is not y2038 safe.
The subsequent patches in the series add support for new socket
timeout options with _NEW suffix that will use y2038 safe
data structures. Although the existing struct timeval layout
is sufficiently wide to represent timeouts, because of the way
libc will interpret time_t based on user defined flag, these
new flags provide a way of having a structure that is the same
for all architectures consistently.
Rename the existing options with _OLD suffix forms so that the
right option is enabled for userspace applications according
to the architecture and time_t definition of libc.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: ccaulfie@redhat.com
Cc: deller@gmx.de
Cc: paulus@samba.org
Cc: ralf@linux-mips.org
Cc: rth@twiddle.net
Cc: cluster-devel@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# aa563d7b 19-Oct-2018 David Howells <dhowells@redhat.com>

iov_iter: Separate type from direction and use accessor functions

In the iov_iter struct, separate the iterator type from the iterator
direction and use accessor functions to access them in most places.

Convert a bunch of places to use switch-statements to access them rather
then chains of bitwise-AND statements. This makes it easier to add further
iterator types. Also, this can be more efficient as to implement a switch
of small contiguous integers, the compiler can use ~50% fewer compare
instructions than it has to use bitwise-and instructions.

Further, cease passing the iterator type into the iterator setup function.
The iterator function can set that itself. Only the direction is required.

Signed-off-by: David Howells <dhowells@redhat.com>


# da3627c3 28-May-2018 Gang He <ghe@suse.com>

dlm: remove O_NONBLOCK flag in sctp_connect_to_sock

We should remove O_NONBLOCK flag when calling sock->ops->connect()
in sctp_connect_to_sock() function.
Why?
1. up to now, sctp socket connect() function ignores the flag argument,
that means O_NONBLOCK flag does not take effect, then we should remove
it to avoid the confusion (but is not urgent).
2. for the future, there will be a patch to fix this problem, then the flag
argument will take effect, the patch has been queued at https://git.kernel.o
rg/pub/scm/linux/kernel/git/davem/net.git/commit/net/sctp?id=644fbdeacf1d3ed
d366e44b8ba214de9d1dd66a9.
But, the O_NONBLOCK flag will make sock->ops->connect() directly return
without any wait time, then the connection will not be established, DLM kernel
module will call sock->ops->connect() again and again, the bad results are,
CPU usage is almost 100%, even trigger soft_lockup problem if the related
configurations are enabled,
DLM kernel module also prints lots of messages like,
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
The upper application (e.g. ocfs2 mount command) is hanged at new_lockspace(),
the whole backtrace is as below,
tb0307-nd2:~ # cat /proc/2935/stack
[<0>] new_lockspace+0x957/0xac0 [dlm]
[<0>] dlm_new_lockspace+0xae/0x140 [dlm]
[<0>] user_cluster_connect+0xc3/0x3a0 [ocfs2_stack_user]
[<0>] ocfs2_cluster_connect+0x144/0x220 [ocfs2_stackglue]
[<0>] ocfs2_dlm_init+0x215/0x440 [ocfs2]
[<0>] ocfs2_fill_super+0xcb0/0x1290 [ocfs2]
[<0>] mount_bdev+0x173/0x1b0
[<0>] mount_fs+0x35/0x150
[<0>] vfs_kern_mount.part.23+0x54/0x100
[<0>] do_mount+0x59a/0xc40
[<0>] SyS_mount+0x80/0xd0
[<0>] do_syscall_64+0x76/0x140
[<0>] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[<0>] 0xffffffffffffffff

So, I think we should remove O_NONBLOCK flag here, since DLM kernel module can
not handle non-block sockect in connect() properly.

Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# f706d830 02-May-2018 Gang He <ghe@suse.com>

dlm: make sctp_connect_to_sock() return in specified time

When the user setup a two-ring cluster, DLM kernel module
will automatically selects to use SCTP protocol to communicate
between each node. There will be about 5 minute hang in DLM
kernel module, in case one ring is broken before switching to
another ring, this will potentially affect the dependent upper
applications, e.g. ocfs2, gfs2, clvm and clustered-MD, etc.
Unfortunately, if the user setup a two-ring cluster, we can not
specify DLM communication protocol with TCP explicitly, since
DLM kernel module only supports SCTP protocol for multiple
ring cluster.
Base on my investigation, the time is spent in sock->ops->connect()
function before returns ETIMEDOUT(-110) error, since O_NONBLOCK
argument in connect() function does not work here, then we should
make sock->ops->connect() function return in specified time via
setting socket SO_SNDTIMEO atrribute.

Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# b09c603c 01-May-2018 Gang He <ghe@suse.com>

dlm: fix a clerical error when set SCTP_NODELAY

There is a clerical error when turn off Nagle's algorithm in
sctp_connect_to_sock() function, this results in turn off
Nagle's algorithm failure.
After this correction, DLM performance will be improved obviously
when using SCTP procotol.

Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David Teigland <teigland@redhat.com>


# 9b2c45d4 12-Feb-2018 Denys Vlasenko <dvlasenk@redhat.com>

net: make getname() functions return length rather than use int* parameter

Changes since v1:
Added changes in these files:
drivers/infiniband/hw/usnic/usnic_transport.c
drivers/staging/lustre/lnet/lnet/lib-socket.c
drivers/target/iscsi/iscsi_target_login.c
drivers/vhost/net.c
fs/dlm/lowcomms.c
fs/ocfs2/cluster/tcp.c
security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

text data bss dec hex filename
30108430 2633624 873672 33615726 200ef6e vmlinux.before.o
30108109 2633612 873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>


# c8c7840e 20-Sep-2017 Al Viro <viro@zeniv.linux.org.uk>

dlm: switch to sock_recvmsg()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 26b41099 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: fix NULL pointer dereference in send_to_sock()

The writequeue and writequeue_lock member of othercon was not initialized.
If lowcomms_state_change() is called from network layer, othercon->swork
may be scheduled. In this case, send_to_sock() will generate a NULL pointer
reference. We avoid this problem by correctly initializing writequeue and
writequeue_lock member of othercon.

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# 0aa18464 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: fix to reschedule rwork

When an error occurs in kernel_recvmsg or kernel_sendpage and
close_connection is called and receive work is already scheduled,
receive work is canceled. In that case, the receive work will not
be scheduled forever after reconnection, because CF_READ_PENDING
flag is established.

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# 93eaadeb 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: fix to use sk_callback_lock correctly

In the current implementation, we think that exclusion control between
processing to set the callback function to the connection structure and
processing to refer to the connection structure from the callback function
was not enough. We fix them.

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# 3421fb15 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: fix memory leak in tcp_accept_from_sock()

The sk member of the socket generated by sock_create_kern() is overwritten
by ops->accept(). So the previous sk will not be released.
We use kernel_accept() instead of sock_create_kern() and ops->accept().

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# 173a31fe 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: use CF_CLOSE flag to stop dlm_send correctly

If reconnection fails while executing dlm_lowcomms_stop,
dlm_send will not stop.

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# 8a4abb08 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: Reanimate CF_WRITE_PENDING flag

CF_WRITE_PENDING flag has been reanimated to make dlm_send stop properly
when running dlm_lowcomms_stop.

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# c553e173 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: close othercon at send/receive error

If an error occurs in the sending / receiving process, if othercon
exists, sending / receiving processing using othercon may also result
in an error. We fix to pre-close othercon as well.

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# c7355827 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: fix to use sock_mutex correctly in xxx_accept_from_sock

In the current implementation, we think that exclusion control
for othercon in tcp_accept_from_sock() and sctp_accept_from_sock()
was not enough. We fix them.

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# b2a66629 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: fix race condition between dlm_send and dlm_recv

When kernel_sendpage(in send_to_sock) and kernel_recvmsg
(in receive_from_sock) return error, close_connection may works at the
same time. At that time, they may wait for each other by cancel_work_sync.

Signed-off-by: Tadashi Miyauchi <miayuchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# f0fb83cb 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: fix double list_del()

dlm_lowcomms_stop() was not functioning properly. Correctly, we have to
wait until all processing is finished with send_workqueue and
recv_workqueue.
This problem causes the following issue. Senario is

1. dlm_send thread:
send_to_sock refers con->writequeue
2. main thread:
dlm_lowcomms_stop calls list_del
3. dlm_send thread:
send_to_sock calls list_del in writequeue_entry_complete

[ 1925.770305] dlm: canceled swork for node 4
[ 1925.772374] general protection fault: 0000 [#1] SMP
[ 1925.777930] Modules linked in: ocfs2_stack_user ocfs2 ocfs2_nodemanager ocfs2_stackglue dlm fmxnet(O) fmx_api(O) fmx_cu(O) igb(O) kvm_intel kvm irqbypass autofs4
[ 1925.794131] CPU: 3 PID: 6994 Comm: kworker/u8:0 Tainted: G O 4.4.39 #1
[ 1925.802684] Hardware name: TOSHIBA OX/OX, BIOS OX-P0015 12/03/2015
[ 1925.809595] Workqueue: dlm_send process_send_sockets [dlm]
[ 1925.815714] task: ffff8804398d3c00 ti: ffff88046910c000 task.ti: ffff88046910c000
[ 1925.824072] RIP: 0010:[<ffffffffa04bd158>] [<ffffffffa04bd158>] process_send_sockets+0xf8/0x280 [dlm]
[ 1925.834480] RSP: 0018:ffff88046910fde0 EFLAGS: 00010246
[ 1925.840411] RAX: dead000000000200 RBX: 0000000000000001 RCX: 000000000000000a
[ 1925.848372] RDX: ffff88046bd980c0 RSI: 0000000000000000 RDI: ffff8804673c5670
[ 1925.856341] RBP: ffff88046910fe20 R08: 00000000000000c9 R09: 0000000000000010
[ 1925.864311] R10: ffffffff81e22fc0 R11: 0000000000000000 R12: ffff8804673c56d8
[ 1925.872281] R13: ffff8804673c5660 R14: ffff88046bd98440 R15: 0000000000000058
[ 1925.880251] FS: 0000000000000000(0000) GS:ffff88047fd80000(0000) knlGS:0000000000000000
[ 1925.889280] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1925.895694] CR2: 00007fff09eadf58 CR3: 00000004690f5000 CR4: 00000000001006e0
[ 1925.903663] Stack:
[ 1925.905903] ffff8804673c5630 ffff8804673c5620 ffff8804673c5670 ffff88007d219b40
[ 1925.914181] ffff88046f095800 0000000000000100 ffff8800717a1400 ffff8804673c56d8
[ 1925.922459] ffff88046910fe60 ffffffff81073db2 00ff880400000000 ffff88007d219b40
[ 1925.930736] Call Trace:
[ 1925.933468] [<ffffffff81073db2>] process_one_work+0x162/0x450
[ 1925.939983] [<ffffffff81074459>] worker_thread+0x69/0x4a0
[ 1925.946109] [<ffffffff810743f0>] ? rescuer_thread+0x350/0x350
[ 1925.952622] [<ffffffff8107956f>] kthread+0xef/0x110
[ 1925.958165] [<ffffffff81079480>] ? kthread_park+0x60/0x60
[ 1925.964283] [<ffffffff8186ab2f>] ret_from_fork+0x3f/0x70
[ 1925.970312] [<ffffffff81079480>] ? kthread_park+0x60/0x60
[ 1925.976436] Code: 01 00 00 48 8b 7d d0 e8 07 d3 3a e1 45 01 7e 18 45 29 7e 1c 75 ab 41 8b 46 24 85 c0 75 a3 49 8b 16 49 8b 46 08 31 f6 48 89 42 08 <48> 89 10 48 b8 00 01 00 00 00 00 ad de 49 8b 7e 10 49 89 06 66
[ 1925.997791] RIP [<ffffffffa04bd158>] process_send_sockets+0xf8/0x280 [dlm]
[ 1926.005577] RSP <ffff88046910fde0>

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# 988419a9 12-Sep-2017 tsutomu.owa@toshiba.co.jp <tsutomu.owa@toshiba.co.jp>

DLM: fix remove save_cb argument from add_sock()

save_cb argument is not used. We remove them.

Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# cc661fc9 12-Sep-2017 Bob Peterson <rpeterso@redhat.com>

DLM: Fix saving of NULL callbacks

In a previous patch I noted that accept() often copies the struct
sock (sk) which overwrites the sock callbacks. However, in testing
we discovered that the dlm connection structures (con) are sometimes
deleted and recreated as connections come and go, and since they're
zeroed out by kmem_cache_zalloc, the saved callback pointers are
also initialized to zero. But with today's DLM code, the callbacks
are only saved when a socket is added.

During recovery testing, we discovered a common situation in which
the new con is initialized to zero, then a socket is added after
accept(). In this case, the sock's saved values are all NULL, but
the saved values are wiped out, due to accept(). Therefore, we
don't have a known good copy of the callbacks from which we can
restore.

Since the struct sock callbacks are always good after listen(),
this patch saves the known good values after listen(). These good
values are then used for subsequent restores.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# 01da24d3 12-Sep-2017 Bob Peterson <rpeterso@redhat.com>

DLM: Eliminate CF_WRITE_PENDING flag

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# 61d9102b 12-Sep-2017 Bob Peterson <rpeterso@redhat.com>

DLM: Eliminate CF_CONNECT_PENDING flag

Before this patch, there was a flag in the con structure that was
used to determine whether or not a connect was needed. The bit was
set here and there, and cleared here and there, so it left some
race conditions: the bit was set, work was queued, then the worker
cleared the bit, allowing someone else to set it while the worker
ran. For the most part, this worked okay, but we got into trouble
if connections were lost and it needed to reconnect.

This patch eliminates the flag in favor of simply checking if we
actually have a sock pointer while protected by the mutex.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1c242853 07-Aug-2017 Guoqing Jiang <gqjiang@suse.com>

dlm: use sock_create_lite inside tcp_accept_from_sock

With commit 0ffdaf5b41cf ("net/sock: add WARN_ON(parent->sk)
in sock_graft()"), a calltrace happened as follows:

[ 457.018340] WARNING: CPU: 0 PID: 15623 at ./include/net/sock.h:1703 inet_accept+0x135/0x140
...
[ 457.018381] RIP: 0010:inet_accept+0x135/0x140
[ 457.018381] RSP: 0018:ffffc90001727d18 EFLAGS: 00010286
[ 457.018383] RAX: 0000000000000001 RBX: ffff880012413000 RCX: 0000000000000001
[ 457.018384] RDX: 000000000000018a RSI: 00000000fffffe01 RDI: ffffffff8156fae8
[ 457.018384] RBP: ffffc90001727d38 R08: 0000000000000000 R09: 0000000000004305
[ 457.018385] R10: 0000000000000001 R11: 0000000000004304 R12: ffff880035ae7a00
[ 457.018386] R13: ffff88001282af10 R14: ffff880034e4e200 R15: 0000000000000000
[ 457.018387] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 457.018388] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 457.018389] CR2: 00007fdec22f9000 CR3: 0000000002b5a000 CR4: 00000000000006f0
[ 457.018395] Call Trace:
[ 457.018402] tcp_accept_from_sock.part.8+0x12d/0x449 [dlm]
[ 457.018405] ? vprintk_emit+0x248/0x2d0
[ 457.018409] tcp_accept_from_sock+0x3f/0x50 [dlm]
[ 457.018413] process_recv_sockets+0x3b/0x50 [dlm]
[ 457.018415] process_one_work+0x138/0x370
[ 457.018417] worker_thread+0x4d/0x3b0
[ 457.018419] kthread+0x109/0x140
[ 457.018421] ? rescuer_thread+0x320/0x320
[ 457.018422] ? kthread_park+0x60/0x60
[ 457.018424] ret_from_fork+0x25/0x30

Since newsocket created by sock_create_kern sets it's
sock by the path:

sock_create_kern -> __sock_creat
->pf->create => inet_create
-> sock_init_data

Then WARN_ON is triggered by "con->sock->ops->accept =>
inet_accept -> sock_graft", it also means newsock->sk
is leaked since sock_graft will replace it with a new
sk.

To resolve the issue, we need to use sock_create_lite
instead of sock_create_kern, like commit 0933a578cd55
("rds: tcp: use sock_create_lite() to create the accept
socket") did.

Reported-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# cdfbabfb 09-Mar-2017 David Howells <dhowells@redhat.com>

net: Work around lockdep limitation in sockets that use sockets

Lockdep issues a circular dependency warning when AFS issues an operation
through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

The theory lockdep comes up with is as follows:

(1) If the pagefault handler decides it needs to read pages from AFS, it
calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
creating a call requires the socket lock:

mmap_sem must be taken before sk_lock-AF_RXRPC

(2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
binds the underlying UDP socket whilst holding its socket lock.
inet_bind() takes its own socket lock:

sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

(3) Reading from a TCP socket into a userspace buffer might cause a fault
and thus cause the kernel to take the mmap_sem, but the TCP socket is
locked whilst doing this:

sk_lock-AF_INET must be taken before mmap_sem

However, lockdep's theory is wrong in this instance because it deals only
with lock classes and not individual locks. The AF_INET lock in (2) isn't
really equivalent to the AF_INET lock in (3) as the former deals with a
socket entirely internal to the kernel that never sees userspace. This is
a limitation in the design of lockdep.

Fix the general case by:

(1) Double up all the locking keys used in sockets so that one set are
used if the socket is created by userspace and the other set is used
if the socket is created by the kernel.

(2) Store the kern parameter passed to sk_alloc() in a variable in the
sock struct (sk_kern_sock). This informs sock_lock_init(),
sock_init_data() and sk_clone_lock() as to the lock keys to be used.

Note that the child created by sk_clone_lock() inherits the parent's
kern setting.

(3) Add a 'kern' parameter to ->accept() that is analogous to the one
passed in to ->create() that distinguishes whether kernel_accept() or
sys_accept4() was the caller and can be passed to sk_alloc().

Note that a lot of accept functions merely dequeue an already
allocated socket. I haven't touched these as the new socket already
exists before we get the parameter.

Note also that there are a couple of places where I've made the accepted
socket unconditionally kernel-based:

irda_accept()
rds_rcp_accept_one()
tcp_accept_from_sock()

because they follow a sock_create_kern() and accept off of that.

Whilst creating this, I noticed that lustre and ocfs don't create sockets
through sock_create_kern() and thus they aren't marked as for-kernel,
though they appear to be internal. I wonder if these should do that so
that they use the new set of lock keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 26c1ec2f 22-Oct-2016 Wei Yongjun <weiyongjun1@huawei.com>

dlm: fix error return code in sctp_accept_from_sock()

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# d2fee58a 10-Oct-2016 Bob Peterson <rpeterso@redhat.com>

dlm: remove lock_sock to avoid scheduling while atomic

Before this patch, functions save_callbacks and restore_callbacks
called function lock_sock and release_sock to prevent other processes
from messing with the struct sock while the callbacks were saved and
restored. However, function add_sock calls write_lock_bh prior to
calling it save_callbacks, which disables preempts. So the call to
lock_sock would try to schedule when we can't schedule.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 3735b4b9 23-Sep-2016 Bob Peterson <rpeterso@redhat.com>

dlm: don't save callbacks after accept

When DLM calls accept() on a socket, the comm code copies the sk
after we've saved its callbacks. Afterward, it calls add_sock which
saves the callbacks a second time. Since the error reporting function
lowcomms_error_report calls the previous callback too, this results
in a recursive call to itself. This patch adds a new parameter to
function add_sock to tell whether to save the callbacks. Function
tcp_accept_from_sock (and its sctp counterpart) then calls it with
false to avoid the recursion.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 3a8db798 08-Oct-2016 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

dlm: free workqueues after the connections

After backporting commit ee44b4bc054a ("dlm: use sctp 1-to-1 API")
series to a kernel with an older workqueue which didn't use RCU yet, it
was noticed that we are freeing the workqueues in dlm_lowcomms_stop()
too early as free_conn() will try to access that memory for canceling
the queued works if any.

This issue was introduced by commit 0d737a8cfd83 as before it such
attempt to cancel the queued works wasn't performed, so the issue was
not present.

This patch fixes it by simply inverting the free order.

Cc: stable@vger.kernel.org
Fixes: 0d737a8cfd83 ("dlm: fix race while closing connections")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 5c93f56f 22-Jun-2016 Amitoj Kaur Chawla <amitoj1606@gmail.com>

dlm: Use kmemdup instead of kmalloc and memcpy

Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.

The Coccinelle semantic patch used to make this change is as follows:
@@
expression from,to,size,flag;
statement S;
@@

- to = \(kmalloc\|kzalloc\)(size,flag);
+ to = kmemdup(from,size,flag);
if (to==NULL || ...) S
- memcpy(to, from, size);

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b81171cb 05-Feb-2016 Bob Peterson <rpeterso@redhat.com>

DLM: Save and restore socket callbacks properly

This patch fixes the problems with patch b3a5bbfd7.

1. It removes a return statement from lowcomms_error_report
because it needs to call the original error report in all paths
through the function.
2. All socket callbacks are saved and restored, not just the
sk_error_report, and that's done so with proper locking like
sunrpc does.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1a31833d 17-Jan-2016 Bob Peterson <rpeterso@redhat.com>

DLM: Replace nodeid_to_addr with kernel_getpeername

This patch replaces the call to nodeid_to_addr with a call to
kernel_getpeername. This avoids taking a spinlock because it may
potentially be called from a softirq context.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 9cd3e072 29-Nov-2015 Eric Dumazet <edumazet@google.com>

net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA

This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b3a5bbfd 27-Aug-2015 Bob Peterson <rpeterso@redhat.com>

dlm: print error from kernel_sendpage

Print a dlm-specific error when a socket error occurs
when sending a dlm message.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 18df8a87 17-Aug-2015 kbuild test robot <fengguang.wu@intel.com>

dlm: sctp_accept_from_sock() can be static

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 00dcffae 11-Aug-2015 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

dlm: fix reconnecting but not sending data

There are cases on which lowcomms_connect_sock() is called directly,
which caused the CF_WRITE_PENDING flag to not bet set upon reconnect,
specially on send_to_sock() error handling. On this last, the flag was
already cleared and no further attempt on transmitting would be done.

As dlm tends to connect when it needs to transmit something, it makes
sense to always mark this flag right after the connect.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# acee4e52 11-Aug-2015 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

dlm: replace BUG_ON with a less severe handling

BUG_ON() is a severe action for this case, specially now that DLM with
SCTP will use 1 socket per association. Instead, we can just close the
socket on this error condition and return from the function.

Also move the check to an earlier stage as it won't change and thus we
can abort as soon as possible.

Although this issue was reported when still using SCTP with 1-to-many
API, this cleanup wouldn't be that simple back then because we couldn't
close the socket and making sure such event would cease would be hard.
And actually, previous code was closing the association, yet SCTP layer
is still raising the new data event. Probably a bug to be fixed in SCTP.

Reported-by: <tan.hu@zte.com.cn>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# ee44b4bc 11-Aug-2015 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

dlm: use sctp 1-to-1 API

DLM is using 1-to-many API but in a 1-to-1 fashion. That is, it's not
needed but this causes it to use sctp_do_peeloff() to mimic an
kernel_accept() and this causes a symbol dependency on sctp module.

By switching it to 1-to-1 API we can avoid this dependency and also
reduce quite a lot of SCTP-specific code in lowcomms.c.

The caveat is that now DLM won't always use the same src port. It will
choose a random one, just like TCP code. This allows the peers to
attempt simultaneous connections, which now are handled just like for
TCP.

Even more sharing between TCP and SCTP code on DLM is possible, but it
is intentionally left for a later commit.

Note that for using nodes with this commit, you have to have at least
the early fixes on this patchset otherwise it will trigger some issues
on old nodes.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 356344c4 11-Aug-2015 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

dlm: fix not reconnecting on connecting error handling

If we don't clear that bit, lowcomms_connect_sock() will not schedule
another attempt, and no further attempt will be done.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 0d737a8c 11-Aug-2015 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

dlm: fix race while closing connections

When a connection have issues DLM may need to close it. Therefore we
should also cancel pending workqueues for such connection at that time,
and not just when dlm is not willing to use this connection anymore.

Also, if we don't clear CF_CONNECT_PENDING flag, the error handling
routines won't be able to re-connect as lowcomms_connect_sock() will
check for it.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 28926a09 11-Aug-2015 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

dlm: fix connection stealing if using SCTP

When using SCTP and accepting a new connection, DLM currently validates
if the peer trying to connect to it is one of the cluster nodes, but it
doesn't check if it already has a connection to it or not.

If it already had a connection, it will be overwritten, and the new one
will be used for writes, possibly causing the node to leave the cluster
due to communication breakage.

Still, one could DoS the node by attempting N connections and keeping
them open.

As said, but being explicit, both situations are only triggerable from
other cluster nodes, but are doable with only user-level perms.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# eeb1bd5c 08-May-2015 Eric W. Biederman <ebiederm@xmission.com>

net: Add a struct net parameter to sock_create_kern

This is long overdue, and is part of cleaning up how we allocate kernel
sockets that don't reference count struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 883854c5 12-Jun-2014 Lidong Zhong <lzhong@suse.com>

dlm: keep listening connection alive with sctp mode

The connection struct with nodeid 0 is the listening socket,
not a connection to another node. The sctp resend function
was not checking that the nodeid was valid (non-zero), so it
would mistakenly get and resend on the listening connection
when nodeid was zero.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 676d2369 11-Apr-2014 David S. Miller <davem@davemloft.net>

net: Fix use after free by removing length arg from sk_data_ready callbacks.

Several spots in the kernel perform a sequence like:

skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 048ed4b6 21-Jan-2014 wangweidong <wangweidong1@huawei.com>

sctp: remove macros sctp_{lock|release}_sock

Redefined {lock|release}_sock to sctp_{lock|release}_sock for user space friendly
code which we haven't use in years, so removing them.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ece35848 10-Dec-2013 Dongmao Zhang <dmzhang@suse.com>

dlm: set zero linger time on sctp socket

The recovery time for a failed node was taking a long
time because the failed node could not perform the full
shutdown process. Removing the linger time speeds this
up. The dlm does not care what happens to messages to
or from the failed node.

Signed-off-by: Dongmao Zhang <dmzhang@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 06452eb0 18-Jun-2013 Wei Yongjun <yongjun_wei@trendmicro.com.cn>

dlm: remove duplicated include from lowcomms.c

Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David Teigland <teigland@redhat.com>


# 86e92ad2 14-Jun-2013 Mike Christie <michaelc@cs.wisc.edu>

dlm: disable nagle for SCTP

For TCP we disable Nagle and I cannot think of why it would be needed
for SCTP. When disabled it seems to improve dlm_lock operations like it
does for TCP.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>


# 5d689871 14-Jun-2013 Mike Christie <michaelc@cs.wisc.edu>

dlm: retry failed SCTP sends

Currently if a SCTP send fails, we lose the data we were trying
to send because the writequeue_entry is released when we do the send.
When this happens other nodes will then hang waiting for a reply.

This adds support for SCTP to retry the send operation.

I also removed the retry limit for SCTP use, because we want
to make sure we try every path during init time and for longer
failures we want to continually retry in case paths come back up
while trying other paths. We will do this until userspace tells us
to stop.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>


# 98e1b60e 14-Jun-2013 Mike Christie <michaelc@cs.wisc.edu>

dlm: try other IPs when sctp init assoc fails

Currently, if we cannot create a association to the first IP addr
that is added to DLM, the SCTP init assoc code will just retry
the same IP. This patch adds a simple failover schemes where we
will try one of the addresses that was passed into DLM.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>


# b390ca38 14-Jun-2013 Mike Christie <michaelc@cs.wisc.edu>

dlm: clear correct bit during sctp init failure handling

We should be testing and cleaing the init pending bit because later
when sctp_init_assoc is recalled it will be checking that it is not set
and set the bit.

We do not want to touch CF_CONNECT_PENDING here because we will queue
swork and process_send_sockets will then call the connect_action function.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>


# e1631d0c 14-Jun-2013 Mike Christie <michaelc@cs.wisc.edu>

dlm: set sctp assoc id during setup

sctp_assoc was not getting set so later lookups failed.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>


# efad7e6b 14-Jun-2013 Mike Christie <michaelc@cs.wisc.edu>

dlm: clear correct init bit during sctp setup

We were clearing the base con's init pending flags, but the
con for the node was the one with the pending bit set.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1b866434 08-Apr-2013 Daniel Borkmann <daniel@iogearbox.net>

net: sctp: introduce uapi header for sctp

This patch introduces an UAPI header for the SCTP protocol,
so that we can facilitate the maintenance and development of
user land applications or libraries, in particular in terms
of header synchronization.

To not break compatibility, some fragments from lksctp-tools'
netinet/sctp.h have been carefully included, while taking care
that neither kernel nor user land breaks, so both compile fine
with this change (for lksctp-tools I tested with the old
netinet/sctp.h header and with a newly adapted one that includes
the uapi sctp header). lksctp-tools smoke test run through
successfully as well in both cases.

Suggested-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b67bfe0d 27-Feb-2013 Sasha Levin <sasha.levin@oracle.com>

hlist: drop the node parameter from iterators

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# eeee2b5f 18-Oct-2012 Wei Yongjun <yongjun_wei@trendmicro.com.cn>

dlm: remove unused variable in *dlm_lowcomms_get_buffer()

The variable users is initialized but never used
otherwise, so remove the unused variable.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David Teigland <teigland@redhat.com>


# 9c5bef58 13-Aug-2012 Ying Xue <ying.xue@windriver.com>

dlm: cleanup send_to_sock routine

Remove unnecessary code form send_to_sock routine.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 4dd40f0c 10-Aug-2012 Ying Xue <ying.xue@windriver.com>

dlm: convert add_sock routine return value type to void

Since add_sock() always returns a success code - 0, its return
value type should be changed from integer to void.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# b4c798cf 09-Aug-2012 Xue Ying <ying.xue@windriver.com>

dlm: remove redundant variable assignments

Once the tcp_create_listen_sock() is returned successfully, we
will invoke add_sock() immediately. In add_sock(), the 'con'
variable is assigned to 'sk_user_data', meanwhile, the 'sock' is
also set to 'con->sock'. So it's unnecessary to do the same thing
in tcp_create_listen_sock().

Signed-off-by: Xue Ying <ying.xue@windriver.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 36b71a8b 25-Jul-2012 David Teigland <teigland@redhat.com>

dlm: fix deadlock between dlm_send and dlm_controld

A deadlock sometimes occurs between dlm_controld closing
a lowcomms connection through configfs and dlm_send looking
up the address for a new connection in configfs.

dlm_controld does a configfs rmdir which calls
dlm_lowcomms_close which waits for dlm_send to
cancel work on the workqueues.

The dlm_send workqueue thread has called
tcp_connect_to_sock which calls dlm_nodeid_to_addr
which does a configfs lookup and blocks on a lock
held by dlm_controld in the rmdir path.

The solution here is to save the node addresses within
the lowcomms code so that the lowcomms workqueue does
not need to step through configfs to get a node address.

dlm_controld:
wait_for_completion+0x1d/0x20
__cancel_work_timer+0x1b3/0x1e0
cancel_work_sync+0x10/0x20
dlm_lowcomms_close+0x4c/0xb0 [dlm]
drop_comm+0x22/0x60 [dlm]
client_drop_item+0x26/0x50 [configfs]
configfs_rmdir+0x180/0x230 [configfs]
vfs_rmdir+0xbd/0xf0
do_rmdir+0x103/0x120
sys_rmdir+0x16/0x20

dlm_send:
mutex_lock+0x2b/0x50
get_comm+0x34/0x140 [dlm]
dlm_nodeid_to_addr+0x18/0xd0 [dlm]
tcp_connect_to_sock+0xf4/0x2d0 [dlm]
process_send_sockets+0x1d2/0x260 [dlm]
worker_thread+0x170/0x2a0

Signed-off-by: David Teigland <teigland@redhat.com>


# 513ef596 30-Mar-2012 David Teigland <teigland@redhat.com>

dlm: prevent connections during shutdown

During lowcomms shutdown, a new connection could possibly
be created, and attempt to use a workqueue that's been
destroyed. Similarly, during startup, a new connection
could attempt to use a workqueue that's not been set up
yet. Add a global variable to indicate when new connections
are allowed.

Based on patch by: Christine Caulfield <ccaulfie@redhat.com>

Reported-by: dann frazier <dann.frazier@canonical.com>
Reviewed-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1b189b88 21-Mar-2012 David Teigland <teigland@redhat.com>

dlm: last element of dlm_local_addr[] never used

The last element of dlm_local_addr[DLM_MAX_ADDR_COUNT]
was not used because the loop ended at COUNT - 1.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 2f2d76cc 07-Mar-2012 Benjamin Poirier <bpoirier@suse.de>

dlm: Do not allocate a fd for peeloff

avoids allocating a fd that a) propagates to every kernel thread and
usermodehelper b) is not properly released.

References: http://article.gmane.org/gmane.linux.network.drbd/22529
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4e3fd7a0 20-Nov-2011 Alexey Dobriyan <adobriyan@gmail.com>

net: remove ipv6_addr_copy()

C assignment can handle struct in6_addr copying.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bcaadf5c 03-Jul-2011 Masatake YAMATO <yamato@redhat.com>

dlm: dump address of unknown node

When the dlm fails to make a network connection to another
node, include the address of the node in the error message.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 25985edc 30-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# e43f055a 10-Mar-2011 David Teigland <teigland@redhat.com>

dlm: use alloc_workqueue function

Replaces deprecated create_singlethread_workqueue().

Signed-off-by: David Teigland <teigland@redhat.com>


# 6b155c8f 11-Feb-2011 David Teigland <teigland@redhat.com>

dlm: use single thread workqueues

The recent commit to use cmwq for send and recv threads
dcce240ead802d42b1e45ad2fcb2ed4a399cb255 introduced problems,
apparently due to multiple workqueue threads. Single threads
make the problems go away, so return to that until we fully
understand the concurrency issues with multiple threads.

Signed-off-by: David Teigland <teigland@redhat.com>


# b9d41052 13-Dec-2010 Namhyung Kim <namhyung@gmail.com>

dlm: sanitize work_start() in lowcomms.c

The create_workqueue() returns NULL if failed rather than ERR_PTR().
Fix error checking and remove unnecessary variable 'error'.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>


# f92c8dd7 12-Nov-2010 Bob Peterson <rpeterso@redhat.com>

dlm: reduce cond_resched during send

Calling cond_resched() after every send can unnecessarily
degrade performance. Go back to an old method of scheduling
after 25 messages.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# cb2d45da 12-Nov-2010 David Teigland <teigland@redhat.com>

dlm: use TCP_NODELAY

Nagling doesn't help and can sometimes hurt dlm comms.

Signed-off-by: David Teigland <teigland@redhat.com>


# dcce240e 11-Nov-2010 Steven Whitehouse <swhiteho@redhat.com>

dlm: Use cmwq for send and receive workqueues

So far as I can tell, there is no reason to use a single-threaded
send workqueue for dlm, since it may need to send to several sockets
concurrently. Both workqueues are set to WQ_MEM_RECLAIM to avoid
any possible deadlocks, WQ_HIGHPRI since locking traffic is highly
latency sensitive (and to avoid a priority inversion wrt GFS2's
glock_workqueue) and WQ_FREEZABLE just in case someone needs to do
that (even though with current cluster infrastructure, it doesn't
make sense as the node will most likely land up ejected from the
cluster) in the future.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>


# b36930dd 10-Nov-2010 David Miller <davem@davemloft.net>

dlm: Handle application limited situations properly.

In the normal regime where an application uses non-blocking I/O
writes on a socket, they will handle -EAGAIN and use poll() to
wait for send space.

They don't actually sleep on the socket I/O write.

But kernel level RPC layers that do socket I/O operations directly
and key off of -EAGAIN on the write() to "try again later" don't
use poll(), they instead have their own sleeping mechanism and
rely upon ->sk_write_space() to trigger the wakeup.

So they do effectively sleep on the write(), but this mechanism
alone does not let the socket layers know what's going on.

Therefore they must emulate what would have happened, otherwise
TCP cannot possibly see that the connection is application window
size limited.

Handle this, therefore, like SUNRPC by setting SOCK_NOSPACE and
bumping the ->sk_write_count as needed when we hit the send buffer
limits.

This should make TCP send buffer size auto-tuning and the
->sk_write_space() callback invocations actually happen.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David Teigland <teigland@redhat.com>


# f70cb33b 03-Aug-2010 Julia Lawall <julia@diku.dk>

fs/dlm: Drop unnecessary null test

hlist_for_each_entry binds its first argument to a non-null value, and thus
any null test on the value of that argument is superfluous.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
iterator I;
expression x,E,E1,E2;
statement S,S1,S2;
@@

I(x,...) { <...
- (x != NULL) &&
E
...> }
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David Teigland <teigland@redhat.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 573c24c4 30-Nov-2009 David Teigland <teigland@redhat.com>

dlm: always use GFP_NOFS

Replace all GFP_KERNEL and ls_allocation with GFP_NOFS.
ls_allocation would be GFP_KERNEL for userland lockspaces
and GFP_NOFS for file system lockspaces.

It was discovered that any lockspaces on the system can
affect all others by triggering memory reclaim in the
file system which could in turn call back into the dlm
to acquire locks, deadlocking dlm threads that were
shared by all lockspaces, like dlm_recv.

Signed-off-by: David Teigland <teigland@redhat.com>


# 6861f350 24-Sep-2009 David Teigland <teigland@redhat.com>

dlm: fix socket fd translation

The code to set up sctp sockets was not using the sockfd_lookup()
and sockfd_put() routines to translate an fd to a socket. The
direct fget and fput calls were resulting in error messages from
alloc_fd().

Also clean up two log messages and remove a third, related to
setting up sctp associations.

Signed-off-by: David Teigland <teigland@redhat.com>


# 04bedd79 18-Sep-2009 David Teigland <teigland@redhat.com>

dlm: fix lowcomms_connect_node for sctp

The recently added dlm_lowcomms_connect_node() from
391fbdc5d527149578490db2f1619951d91f3561 does not work
when using SCTP instead of TCP. The sctp connection code
has nothing to do without data to send. Check for no data
in the sctp connection code and do nothing instead of
triggering a BUG. Also have connect_node() do nothing
when the protocol is sctp.

Signed-off-by: David Teigland <teigland@redhat.com>


# 1329e3f2 24-Aug-2009 Paolo Bonzini <bonzini@gnu.org>

dlm: use kernel_sendpage

Using kernel_sendpage() is cleaner and safer than following
sock->ops ourselves.

Signed-off-by: Paolo Bonzini <bonzini@gnu.org>
Signed-off-by: David Teigland <teigland@redhat.com>


# 063c4c99 11-Aug-2009 Lars Marowsky-Bree <lmb@suse.de>

dlm: fix connection close handling

Closing a connection to a node can create problems if there are
outstanding messages for that node. The problems include dlm_send
spinning attempting to reconnect, or BUG from tcp_connect_to_sock()
attempting to use a partially closed connection.

To cleanly close a connection, we now first attempt to send any pending
messages, cancel any remaining workqueue work, and flag the connection
as closed to avoid reconnect attempts.

Signed-off-by: Lars Marowsky-Bree <lmb@suse.de>
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# b5711b8e 27-Jul-2009 Casey Dahlin <cdahlin@redhat.com>

dlm: fix double-release of socket in error exit path

The last correction to the tcp_connect_to_sock error exit path,
commit a89d63a159b1ba5833be2bef00adf8ad8caac8be, can free an already
freed socket, due to collision with a previous (incomplete) attempt
to fix the same issue, commit 311f6fc77c51926dbdfbeab0a5d88d70f01fa3f4.

Signed-off-by: Casey Dahlin <cdahlin@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# a89d63a1 13-Jul-2009 Casey Dahlin <cdahlin@redhat.com>

dlm: free socket in error exit path

In the tcp_connect_to_sock() error exit path, the socket
allocated at the top of the function was not being freed.

Signed-off-by: Casey Dahlin <cdahlin@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 748285cc 15-May-2009 David Teigland <teigland@redhat.com>

dlm: use more NOFS allocation

Change some GFP_KERNEL allocations to use either GFP_NOFS or
ls_allocation (when available) which the fs sets to GFP_NOFS.
The point is to prevent allocations from going back into the
cluster fs in places where that might lead to deadlock.

Signed-off-by: David Teigland <teigland@redhat.com>


# 391fbdc5 07-May-2009 Christine Caulfield <ccaulfie@redhat.com>

dlm: connect to nodes earlier

Make network connections to other nodes earlier, in the context of
dlm_recoverd. This avoids connecting to nodes from dlm_send where we
try to avoid allocations which could possibly deadlock if memory reclaim
goes into the cluster fs which may try to do a dlm operation.

Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 5e9ccc37 27-Jan-2009 Christine Caulfield <ccaulfie@redhat.com>

dlm: replace idr with hash table for connections

Integer nodeids can be too large for the idr code; use a hash
table instead.

Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 2cf12c0b 22-Jan-2009 Joe Perches <joe@perches.com>

dlm: comment typo fixes

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 44ad532b 22-Jan-2009 Joe Perches <joe@perches.com>

dlm: use ipv6_addr_copy

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 1521848c 12-Nov-2008 Steven Whitehouse <swhiteho@redhat.com>

dlm: remove kmap/kunmap

The pages used in lowcomms are not highmem, so kmap is not necessary.

Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# d6d7b702 12-Nov-2008 Steven Whitehouse <swhiteho@redhat.com>

dlm: fix up memory allocation flags

Use ls_allocation for memory allocations, which a cluster fs sets to
GFP_NOFS. Use GFP_NOFS for allocations when no lockspace struct is
available. Taking dlm locks needs to avoid calling back into the
cluster fs because write-out can require taking dlm locks.

Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 311f6fc7 27-Jun-2008 Masatake YAMATO <yamato@redhat.com>

dlm: release socket on error

It seems that `sock' allocated by sock_create_kern in
tcp_connect_to_sock() of dlm/fs/lowcomms.c is not released if
dlm_nodeid_to_addr an error.

Acked-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 0035a4b1 11-May-2008 Marcin Slusarz <marcin.slusarz@gmail.com>

dlm: tcp_connect_to_sock should check for -EINVAL, not EINVAL

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: David Teigland <teigland@redhat.com>


# 7a936ce7 12-May-2008 Matthias Kaehlcke <matthias@kaehlcke.net>

dlm: convert connections_lock in a mutex

The semaphore connections_lock is used as a mutex. Convert it to the mutex
API.

Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Teigland <teigland@redhat.com>


# 39bd4177 09-Jan-2008 Patrick Caulfeld <pcaulfie@redhat.com>

dlm: close othercons

This patch addresses a problem introduced with the last round of
lowcomms patches where the 'othercon' connections do not get freed when
the DLM shuts down.

This results in the error message
"slab error in kmem_cache_destroy(): cache `dlm_conn': Can't free all
objects"

and the DLM cannot be restarted without a system reboot.

See bz#428119

Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# 6bd8feda 25-Oct-2007 Lon Hohberger <lhh@redhat.com>

dlm: bind connections from known local address when using TCP

A common problem occurs when multiple IP addresses within the same
subnet are assigned to the same NIC. If we make a connection attempt to
another address on the same subnet as one of those addresses, the
connection attempt will not necessarily be routed from the address we
want.

In the case of the DLM, the other nodes will quickly drop the connection
attempt, causing problems.

This patch makes the DLM bind to the local address it acquired from the
cluster manager when using TCP prior to making a connection, obviating
the need for administrators to "fix" their systems or use clever routing
tricks.

Signed-off-by: Lon Hohberger <lhh@redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>


# df61c952 07-Nov-2007 David S. Miller <davem@sunset.davemloft.net>

[DLM] lowcomms: Do not muck with sysctl_rmem_max.

Use SO_RCVBUFFORCE instead.

Signed-off-by: David S. Miller <davem@davemloft.net>


# d66f8277 14-Sep-2007 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] Make dlm_sendd cond_resched more

Under high recovery loads dlm_sendd can monopolise the CPU and cause soft lockups.

This one extra and one moved cond_resched() make it yield a little more during
such times keeping work moving.

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 61d96be0 20-Aug-2007 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] Fix lowcomms socket closing

This patch fixes the slight mess made in lowcomms closing by previous patches
and fixes all sorts of DLM hangs.

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 9e5f2825 02-Aug-2007 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] More othercon fixes

The last patch to clean out 'othercon' structures only fixed half the problem.
The attached addresses the other situations too, and fixes bz#238490

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 01c8cab2 17-Jul-2007 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] zero unused parts of sockaddr_storage

When we build a sockaddr_storage for an IP address, clear the unused parts as
they could be used for node comparisons.

I have seen this occasionally make sctp connections fail.

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 25720c2d 11-Jul-2007 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] Clear othercon pointers when a connection is closed

This patch clears the othercon pointer and frees the memory when a connnection
is closed. This could cause a small memory leak when nodes leave the cluster.

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 20c2df83 19-Jul-2007 Paul Mundt <lethal@linux-sh.org>

mm: Remove slab destructors from kmem_cache_create().

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>


# f4fadb23c 27-Jun-2007 Andrew Morton <akpm@linux-foundation.org>

[GFS2] git-gfs2-nmw-build-fix

Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 97d84836 27-Jun-2007 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] Telnet to port 21064 can stop all lockspaces

This patch fixes Red Hat bz#245892

Opening a tcp connection from a cluster member to another cluster member
targeting the dlm port it is enough to stop every dlm operation in the cluster.
This means that GFS and rgmanager will hang.

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# afb853fb 01-Jun-2007 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] fix socket shutdown

This patch clears the user_data of active sockets as part of cleanup.
This prevents any late-arriving data from trying to add jobs to the work
queue while we are tidying up.

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-Off-By: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 617e82e1 26-Apr-2007 David Teigland <teigland@redhat.com>

[DLM] lowcomms style

Replace some printk with log_print, and fix some simple cases of lines
over 80. Also, return -ENOTCONN if lowcomms_start fails due to no local
IP address being available.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 30d3a237 23-Apr-2007 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] Lowcomms nodeid range & initialisation fixes

Fix a few range & initialization bugs in lowcomms.
- max_nodeid is really the highest nodeid encountered, so all loops must include
it in their iterations.
- clean dlm_local_count & connection_idr so we can do a clean restart.
- Remove a spurious BUG_ON

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 2439fe50 19-Apr-2007 Josef Bacik <jwhiter@redhat.com>

[DLM] Fix dlm_lowcoms_stop hang

When you attempt to release a lockspace in DLM, it will hang trying to down a
semaphore that has already been downed. The attached patch fixes the problem.

Signed-off-by: Josef Bacik <jwhiter@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Patrick Caulfield <pcaulfie@redhat.com>


# 6ed7257b 17-Apr-2007 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] Consolidate transport protocols

This patch consolidates the TCP & SCTP protocols for the DLM into a single file
and makes it switchable at run-time (well, at least before the DLM actually
starts up!)

For RHEL5 this patch requires Neil Horman's patch that expands the in-kernel
socket API but that has already been twice ACKed so it should be OK.

The patch adds a new lowcomms.c file that replaces the existing lowcomms-sctp.c
& lowcomms-tcp.c files.

Signed-off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 42fb0083 13-Oct-2006 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] fix iovec length in recvmsg

I didn't spot that the msg_iovlen was set to 2 if there
were two elements in the iovec but left at zero if not :(

I think this might be why bob was still seeing trouble.

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 4c5e1b1a 12-Oct-2006 Patrick Caulfield <pcaulfie@redhat.com>

[DLM] fix iovec length in recvmsg

The DLM always passes the iovec length as 1, this is wrong when the circular
buffer wraps round.

Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 38d6fd26 09-Oct-2006 Al Viro <viro@ftp.linux.org.uk>

[PATCH] dlm gfp_t annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# fcc8abc8 10-Aug-2006 David Teigland <teigland@redhat.com>

[DLM] move kmap to after spin_unlock

Doing the kmap() while holding the spinlock was causing recursive spinlock
problems. It seems the kmap was scheduling, although there was no warning
as I'd expect. Patrick, do we need locking around the kmap?

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 7d5513d5 19-Jun-2006 David Teigland <teigland@redhat.com>

[DLM] init rwsem earlier

The nodeinfo_lock rwsem needs to be initialized when the module is loaded
instead of when the dlm is first used.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 47c96298 25-May-2006 Steven Whitehouse <swhiteho@redhat.com>

[GFS2] Change name due to local_nodeid being a macro

Change names of local_nodeid to dlm_local_nodeid to prevent a
namespace collision. Changed other local variable to match.

Cc: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# 1c032c03 28-Apr-2006 David Teigland <teigland@redhat.com>

[DLM] PATCH 2/3 dlm: lowcomms close

When a node is removed from a lockspace configuration, close our
connection to it, clearing any remaining messages for it.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>


# e7fd4179 18-Jan-2006 David Teigland <teigland@redhat.com>

[DLM] The core of the DLM for GFS2/CLVM

This is the core of the distributed lock manager which is required
to use GFS2 as a cluster filesystem. It is also used by CLVM and
can be used as a standalone lock manager independantly of either
of these two projects.

It implements VAX-style locking modes.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>