History log of /linux-master/net/ceph/messenger.c
Revision Date Author Comments
# 7563cf17 04-Oct-2023 Jordan Rife <jrife@google.com>

libceph: use kernel_connect()

Direct calls to ops->connect() can overwrite the address parameter when
used in conjunction with BPF SOCK_ADDR hooks. Recent changes to
kernel_connect() ensure that callers are insulated from such side
effects. This patch wraps the direct call to ops->connect() with
kernel_connect() to prevent unexpected changes to the address passed to
ceph_tcp_connect().

This change was originally part of a larger patch targeting the net tree
addressing all instances of unprotected calls to ops->connect()
throughout the kernel, but this change was split up into several patches
targeting various trees.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/netdev/20230821100007.559638-1-jrife@google.com/
Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/
Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect")
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# dee0c5f8 01-Jul-2022 Jeff Layton <jlayton@kernel.org>

libceph: add new iov_iter-based ceph_msg_data_type and ceph_osd_data_type

Add an iov_iter to the unions in ceph_msg_data and ceph_msg_data_cursor.
Instead of requiring a list of pages or bvecs, we can just use an
iov_iter directly, and avoid extra allocations.

We assume that the pages represented by the iter are pinned such that
they shouldn't incur page faults, which is the case for the iov_iters
created by netfs.

While working on this, Al Viro informed me that he was going to change
iov_iter_get_pages to auto-advance the iterator as that pattern is more
or less required for ITER_PIPE anyway. We emulate that here for now by
advancing in the _next op and tracking that amount in the "lastlen"
field.

In the event that _next is called twice without an intervening
_advance, we revert the iov_iter by the remaining lastlen before
calling iov_iter_get_pages.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ec3bc567 25-Jan-2022 Jeff Layton <jlayton@kernel.org>

libceph: new sparse_read op, support sparse reads on msgr2 crc codepath

Add support for a new sparse_read ceph_connection operation. The idea is
that the client driver can define this operation use it to do special
handling for incoming reads.

The alloc_msg routine will look at the request and determine whether the
reply is expected to be sparse. If it is, then we'll dispatch to a
different set of state machine states that will repeatedly call the
driver's sparse_read op to get length and placement info for reading the
extent map, and the extents themselves.

This necessitates adding some new field to some other structs:

- The msg gets a new bool to track whether it's a sparse_read request.

- A new field is added to the cursor to track the amount remaining in the
current extent. This is used to cap the read from the socket into the
msg_data

- Handing a revoke with all of this is particularly difficult, so I've
added a new data_len_remain field to the v2 connection info, and then
use that to skip that much on a revoke. We may want to expand the use of
that to the normal read path as well, just for consistency's sake.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8ff2c64c 08-Jul-2023 Ilya Dryomov <idryomov@gmail.com>

rbd: harden get_lock_owner_info() a bit

- we want the exclusive lock type, so test for it directly
- use sscanf() to actually parse the lock cookie and avoid admitting
invalid handles
- bail if locker has a blank address

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>


# 40e0b090 19-Jan-2023 Peilin Ye <peilin.ye@bytedance.com>

net/sock: Introduce trace_sk_data_ready()

As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations. For example:

<...>
iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 98123866 16-Dec-2022 Benjamin Coddington <bcodding@redhat.com>

Treewide: Stop corrupting socket's task_frag

Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag. The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code. I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false. Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# da4ab869 25-May-2022 Jeff Layton <jlayton@kernel.org>

libceph: drop last_piece flag from ceph_msg_data_cursor

ceph_msg_data_next is always passed a NULL pointer for this field. Some
of the "next" operations look at it in order to determine the length,
but we can just take the min of the data on the page or cursor->resid.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 038b8d1d 30-Dec-2021 Ilya Dryomov <idryomov@gmail.com>

libceph: optionally use bounce buffer on recv path in crc mode

Both msgr1 and msgr2 in crc mode are zero copy in the sense that
message data is read from the socket directly into the destination
buffer. We assume that the destination buffer is stable (i.e. remains
unchanged while it is being read to) though. Otherwise, CRC errors
ensue:

libceph: read_partial_message 0000000048edf8ad data crc 1063286393 != exp. 228122706
libceph: osd1 (1)192.168.122.1:6843 bad crc/signature

libceph: bad data crc, calculated 57958023, expected 1805382778
libceph: osd2 (2)192.168.122.1:6876 integrity error, bad crc

Introduce rxbounce option to enable use of a bounce buffer when
receiving message data. In particular this is needed if a mapped
image is a Windows VM disk, passed to QEMU. Windows has a system-wide
"dummy" page that may be mapped into the destination buffer (potentially
more than once into the same buffer) by the Windows Memory Manager in
an effort to generate a single large I/O [1][2]. QEMU makes a point of
preserving overlap relationships when cloning I/O vectors, so krbd gets
exposed to this behaviour.

[1] "What Is Really in That MDL?"
https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85)
[2] https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/

URL: https://bugzilla.redhat.com/show_bug.cgi?id=1973317
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>


# 2d7c86a8 14-Jul-2021 Venky Shankar <vshankar@redhat.com>

libceph: generalize addr/ip parsing based on delimiter

... and remove hardcoded function name in ceph_parse_ips().

[ idryomov: delim parameter, drop CEPH_ADDR_PARSE_DEFAULT_DELIM ]

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a421ef30 14-Jan-2022 Michal Hocko <mhocko@suse.com>

mm: allow !GFP_KERNEL allocations for kvmalloc

Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by
previous patches so we can allow the support for kvmalloc. This will
allow some external users to simplify or completely remove their
helpers.

GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
explicitly documented so let's add a note about that.

ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc.

Link: https://lkml.kernel.org/r/20211122153233.9924-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cd1a677c 19-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph, ceph: implement msgr2.1 protocol (crc and secure modes)

Implement msgr2.1 wire protocol, available since nautilus 14.2.11
and octopus 15.2.5. msgr2.0 wire protocol is not implemented -- it
has several security, integrity and robustness issues and therefore
considered deprecated.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a56dd9bf 12-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: move msgr1 protocol specific fields to its own struct

A couple whitespace fixups, no functional changes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2f713615 12-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: move msgr1 protocol implementation to its own file

A pure move, no other changes.

Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}()
helpers are also moved. msgr2 will bring its own, more efficient,
variants based on iov_iter. Switching msgr1 to them was considered
but decided against to avoid subtle regressions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 566050e1 11-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: separate msgr1 protocol implementation

In preparation for msgr2, define internal messenger <-> protocol
interface (as opposed to external messenger <-> client interface, which
is struct ceph_connection_operations) consisting of try_read(),
try_write(), revoke(), revoke_incoming(), opened(), reset_session() and
reset_protocol() ops. The semantics are exactly the same as they are
now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6503e0b6 09-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: export remaining protocol independent infrastructure

In preparation for msgr2, make all protocol independent functions
in messenger.c global.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 699921d9 09-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: export zero_page

In preparation for msgr2, make zero_page global.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 3fefd43e 09-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: rename and export con->flags bits

In preparation for msgr2, move the defines to the header file.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6d7f62bf 09-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: rename and export con->state states

In preparation for msgr2, rename msgr1 specific states and move the
defines to the header file.

Also drop state transition comments. They don't cover all possible
transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more
harm than good.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 30be780a 09-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: make con->state an int

unsigned long is a leftover from when con->state used to be a set of
bits managed with set_bit(), clear_bit(), etc. Save a bit of memory.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2f687380 05-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: don't export ceph_messenger_{init_fini}() to modules

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# fd1a154c 05-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: make sure our addr->port is zero and addr->nonce is non-zero

Our messenger instance addr->port is normally zero -- anything else is
nonsensical because as a client we connect to multiple servers and don't
listen on any port. However, a user can supply an arbitrary addr:port
via ip option and the port is currently preserved. Zero it.

Conversely, make sure our addr->nonce is non-zero. A zero nonce is
special: in combination with a zero port, it is used to blocklist the
entire ip.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 771294fe 18-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: factor out ceph_con_get_out_msg()

Move the logic of grabbing the next message from the queue into its own
function. Like ceph_con_in_msg_alloc(), this is protocol independent.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# fc4c128e 16-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: change ceph_con_in_msg_alloc() to take hdr

ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and
struct ceph_msg_header in general) is msgr1 specific. While the struct
is deeply ingrained inside and outside the messenger, con->in_hdr field
can be separated.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8ee8abf7 04-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: change ceph_msg_data_cursor_init() to take cursor

Make it possible to have local cursors and embed them outside struct
ceph_msg.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 02471928 13-Oct-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: handle discarding acked and requeued messages separately

Make it easier to follow and remove dependency on msgr1 specific
CEPH_MSGR_TAG_SEQ.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5cd8da3a 13-Oct-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: drop msg->ack_stamp field

It is set in process_ack() but never used.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d3c1248c 11-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: remove redundant session reset log message

Stick with pr_info message because session reset isn't an error most of
the time. When it is (i.e. if the server denies the reconnect attempt),
we get a bunch of other pr_err messages.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a3da057b 11-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: clear con->peer_global_seq on RESETSESSION

con->peer_global_seq is part of session state. Clear it when
the server tells us to reset, not just in ceph_con_close().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5963c3d0 06-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: rename reset_connection() to ceph_con_reset_session()

With just session reset bits left, rename appropriately.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 3596f4c1 06-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: split protocol reset bits out of reset_connection()

Move protocol reset bits into ceph_con_reset_protocol(), leaving
just session reset bits.

Note that con->out_skip is now reset on faults. This fixes a crash
in the case of a stateful session getting a fault while in the middle
of revoking a message.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 90b6561a 06-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: don't call reset_connection() on version/feature mismatches

A fault due to a version mismatch or a feature set mismatch used to be
treated differently from other faults: the connection would get closed
without trying to reconnect and there was a ->bad_proto() connection op
for notifying about that.

This changed a long time ago, see commits 6384bb8b8e88 ("libceph: kill
bad_proto ceph connection op") and 0fa6ebc600bc ("libceph: fix protocol
feature mismatch failure path"). Nowadays these aren't any different
from other faults (i.e. we try to reconnect even though the mismatch
won't resolve until the server is replaced). reset_connection() calls
there are rather confusing because reset_connection() resets a session
together an individual instance of the protocol. This is cleaned up
in the next patch.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 418af5b3 29-Oct-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: lower exponential backoff delay

The current setting allows the backoff to climb up to 5 minutes. This
is too high -- it becomes hard to tell whether the client is stuck on
something or just in backoff.

In userspace, ms_max_backoff is defaulted to 15 seconds. Let's do the
same.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b77f8f0e 05-Nov-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: include middle_len in process_message() dout

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 28e1581c 07-Oct-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: clear con->out_msg on Policy::stateful_server faults

con->out_msg must be cleared on Policy::stateful_server
(!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the
reconnection attempt, because after writing the banner the
messenger moves on to writing the data section of that message
(either from where it got interrupted by the connection reset or
from the beginning) instead of writing struct ceph_msg_connect.
This results in a bizarre error message because the server
sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct
ceph_msg_connect:

libceph: mds0 (1)172.21.15.45:6828 socket error on write
ceph: mds0 reconnect start
libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch

AFAICT this bug goes back to the dawn of the kernel client.
The reason it survived for so long is that only MDS sessions
are stateful and only two MDS messages have a data section:
CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare)
and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved).
The connection has to get reset precisely when such message
is being sent -- in this case it was the former.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/47723
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>


# a9dfe31e 03-Oct-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: format ceph_entity_addr nonces as unsigned

Match the server side logs.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5a5036c8 02-Oct-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: move a dout in queue_con_delay()

The queued con->work can start executing (and therefore logging)
before we get to this "con->work has been queued" message, making
the logs confusing. Move it up, with the meaning of "con->work
is about to be queued".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 40efc4dc 02-Oct-2020 Coly Li <colyli@suse.de>

libceph: use sendpage_ok() in ceph_tcp_sendpage()

In libceph, ceph_tcp_sendpage() does the following checks before handle
the page by network layer's zero copy sendpage method,
if (page_count(page) >= 1 && !PageSlab(page))

This check is exactly what sendpage_ok() does. This patch replace the
open coded checks by sendpage_ok() as a code cleanup.

Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# df561f66 23-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

treewide: Use fallthrough pseudo-keyword

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# 12abc5ee 27-May-2020 Christoph Hellwig <hch@lst.de>

tcp: add tcp_sock_set_nodelay

Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess. Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e8862740 10-Mar-2020 Ilya Dryomov <idryomov@gmail.com>

libceph: fix alloc_msg_with_page_vector() memory leaks

Make it so that CEPH_MSG_DATA_PAGES data item can own pages,
fixing a bunch of memory leaks for a page vector allocated in
alloc_msg_with_page_vector(). Currently, only watch-notify
messages trigger this allocation, and normally the page vector
is freed either in handle_watch_notify() or by the caller of
ceph_osdc_notify(). But if the message is freed before that
(e.g. if the session faults while reading in the message or
if the notify is stale), we leak the page vector.

This was supposed to be fixed by switching to a message-owned
pagelist, but that never happened.

Fixes: 1907920324f1 ("libceph: support for sending notifies")
Reported-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>


# 82995cc6 25-Mar-2019 David Howells <dhowells@redhat.com>

libceph, rbd, ceph: convert to use the new mount API

Convert the ceph filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

[ Numerous string handling, leak and regression fixes; rbd conversion
was particularly broken and had to be redone almost from scratch. ]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 120a75ea 25-Jul-2019 Yan, Zheng <zyan@redhat.com>

libceph: add function that reset client's entity addr

This function also re-open connections to OSD/MON, and re-send in-flight
OSD requests after re-opening connections to OSD.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2c66de56 17-Jun-2019 Jeff Layton <jlayton@kernel.org>

libceph: rename ceph_encode_addr to ceph_encode_banner_addr

...ditto for the decode function. We only use these functions to fix
up banner addresses now, so let's name them more appropriately.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d3c3c0a8 17-Jun-2019 Jeff Layton <jlayton@kernel.org>

libceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONE

Going forward, we'll have different address types so let's use
the addr2 TYPE_LEGACY for internal tracking rather than TYPE_NONE.

Also, make ceph_pr_addr print the address type value as well.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# bc07532c 04-Jun-2019 Jeff Layton <jlayton@kernel.org>

libceph: fix sa_family just after reading address

It doesn't make sense to leave it undecoded until later.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a58946c1 26-Jun-2019 David Howells <dhowells@redhat.com>

keys: Pass the network namespace into request_key mechanism

Create a request_key_net() function and use it to pass the network
namespace domain tag into DNS revolver keys and rxrpc/AFS keys so that keys
for different domains can coexist in the same keyring.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-afs@lists.infradead.org


# d0660f0b 03-May-2019 David Howells <dhowells@redhat.com>

dns_resolver: Allow used keys to be invalidated

Allow used DNS resolver keys to be invalidated after use if the caller is
doing its own caching of the results. This reduces the amount of resources
required.

Fix AFS to invalidate DNS results to kill off permanent failure records
that get lodged in the resolver keyring and prevent future lookups from
happening.

Fixes: 0a5143f2f89c ("afs: Implement VL server rotation")
Signed-off-by: David Howells <dhowells@redhat.com>


# b726ec97 06-May-2019 Jeff Layton <jlayton@kernel.org>

libceph: make ceph_pr_addr take an struct ceph_entity_addr pointer

GCC9 is throwing a lot of warnings about unaligned accesses by
callers of ceph_pr_addr. All of the current callers are passing a
pointer to the sockaddr inside struct ceph_entity_addr.

Fix it to take a pointer to a struct ceph_entity_addr instead,
and then have the function make a copy of the sockaddr before
printing it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# cede185b 06-May-2019 Jeff Layton <jlayton@kernel.org>

libceph: fix unaligned accesses in ceph_entity_addr handling

GCC9 is throwing a lot of warnings about unaligned access. This patch
fixes some of them by changing most of the sockaddr handling functions
to take a pointer to struct ceph_entity_addr instead of struct
sockaddr_storage. The lower functions can then make copies or do
unaligned accesses as needed.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 187df763 22-Mar-2019 Ilya Dryomov <idryomov@gmail.com>

libceph: fix breakage caused by multipage bvecs

A bvec can now consist of multiple physically contiguous pages.
This means that bvec_iter_advance() can move to a different page while
staying in the same bvec (i.e. ->bi_bvec_done != 0).

The messenger works in terms of segments which can now be defined as
the smaller of a bvec and a page. The "more bytes to process in this
segment" condition holds only if bvec_iter_advance() leaves us in the
same bvec _and_ in the same page. On next bvec (possibly in the same
page) and on next page (possibly in the same bvec) we may need to set
->last_piece.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0fd3fd0a 05-Feb-2019 Ilya Dryomov <idryomov@gmail.com>

libceph: handle an empty authorize reply

The authorize reply can be empty, for example when the ticket used to
build the authorizer is too old and TAG_BADAUTHORIZER is returned from
the service. Calling ->verify_authorizer_reply() results in an attempt
to decrypt and validate (somewhat) random data in au->buf (most likely
the signature block from calc_signature()), which fails and ends up in
con_fault_finish() with !con->auth_retry. The ticket isn't invalidated
and the connection is retried again and again until a new ticket is
obtained from the monitor:

libceph: osd2 192.168.122.1:6809 bad authorize reply
libceph: osd2 192.168.122.1:6809 bad authorize reply
libceph: osd2 192.168.122.1:6809 bad authorize reply
libceph: osd2 192.168.122.1:6809 bad authorize reply

Let TAG_BADAUTHORIZER handler kick in and increment con->auth_retry.

Cc: stable@vger.kernel.org
Fixes: 5c056fdc5b47 ("libceph: verify authorize reply on connect")
Link: https://tracker.ceph.com/issues/20164
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 4aac9228 14-Jan-2019 Ilya Dryomov <idryomov@gmail.com>

libceph: avoid KEEPALIVE_PENDING races in ceph_con_keepalive()

con_fault() can transition the connection into STANDBY right after
ceph_con_keepalive() clears STANDBY in clear_standby():

libceph user thread ceph-msgr worker

ceph_con_keepalive()
mutex_lock(&con->mutex)
clear_standby(con)
mutex_unlock(&con->mutex)
mutex_lock(&con->mutex)
con_fault()
...
if KEEPALIVE_PENDING isn't set
set state to STANDBY
...
mutex_unlock(&con->mutex)
set KEEPALIVE_PENDING
set WRITE_PENDING

This triggers warnings in clear_standby() when either ceph_con_send()
or ceph_con_keepalive() get to clearing STANDBY next time.

I don't see a reason to condition queue_con() call on the previous
value of KEEPALIVE_PENDING, so move the setting of KEEPALIVE_PENDING
into the critical section -- unlike WRITE_PENDING, KEEPALIVE_PENDING
could have been a non-atomic flag.

Reported-by: syzbot+acdeb633f6211ccdf886@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Myungho Jung <mhjungk@gmail.com>


# 87349cda 21-Nov-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: switch more to bool in ceph_tcp_sendmsg()

Unlike in ceph_tcp_sendpage(), it's a bool.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 433b0a12 20-Nov-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: use MSG_SENDPAGE_NOTLAST with ceph_tcp_sendpage()

Prevent do_tcp_sendpages() from calling tcp_push() (at least) once per
page. Instead, arrange for tcp_push() to be called (at least) once per
data payload. This results in more MSS-sized packets and fewer packets
overall (5-10% reduction in my tests with typical OSD request sizes).
See commits 2f5338442425 ("tcp: allow splice() to build full TSO
packets"), 35f9c09fe9c7 ("tcp: tcp_sendpages() should call tcp_push()
once") and ae62ca7b0321 ("tcp: fix MSG_SENDPAGE_NOTLAST logic") for
details.

Here is an example of a packet size histogram for 128K OSD requests
(MSS = 1448, top 5):

Before:

SIZE COUNT
1448 777700
952 127915
1200 39238
1219 9806
21 5675

After:

SIZE COUNT
1448 897280
21 6201
1019 2797
643 2739
376 2479

We could do slightly better by explicitly corking the socket but it's
not clear it's worth it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 3239eb52 16-Nov-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: use sock_no_sendpage() as a fallback in ceph_tcp_sendpage()

sock_no_sendpage() makes the code cleaner.

Also, don't set MSG_EOR. sendpage doesn't act on MSG_EOR on its own,
it just honors the setting from the preceding sendmsg call by looking
at ->eor in tcp_skb_can_collapse_to().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1f6b821a 13-Nov-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: drop last_piece logic from write_partial_message_data()

last_piece is for the last piece in the current data item, not in the
entire data payload of the message. This is harmful for messages with
multiple data items. On top of that, we don't need to signal the end
of a data payload either because it is always followed by a footer.

We used to signal "more" unconditionally, until commit fe38a2b67bc6
("libceph: start defining message data cursor"). Part of a large
series, it introduced cursor->last_piece and also mistakenly inverted
the hint by passing last_piece for "more". This was corrected with
commit c2cfa1940097 ("libceph: Fix ceph_tcp_sendpage()'s more boolean
usage").

As it is, last_piece is not helping at all: because Nagle algorithm is
disabled, for a simple message with two 512-byte data items we end up
emitting three packets: front + first data item, second data item and
footer. Go back to the original pre-fe38a2b67bc6 behavior -- a single
packet in most cases.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 7e241f64 08-Nov-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: fall back to sendmsg for slab pages

skb_can_coalesce() allows coalescing neighboring slab objects into
a single frag:

return page == skb_frag_page(frag) &&
off == frag->page_offset + skb_frag_size(frag);

ceph_tcp_sendpage() can be handed slab pages. One example of this is
XFS: it passes down sector sized slab objects for its metadata I/O. If
the kernel client is co-located on the OSD node, the skb may go through
loopback and pop on the receive side with the exact same set of frags.
When tcp_recvmsg() attempts to copy out such a frag, hardened usercopy
complains because the size exceeds the object's allocated size:

usercopy: kernel memory exposure attempt detected from ffff9ba917f20a00 (kmalloc-512) (1024 bytes)

Although skb_can_coalesce() could be taught to return false if the
resulting frag would cross a slab object boundary, we already have
a fallback for non-refcounted pages. Utilize it for slab pages too.

Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# aa563d7b 19-Oct-2018 David Howells <dhowells@redhat.com>

iov_iter: Separate type from direction and use accessor functions

In the iov_iter struct, separate the iterator type from the iterator
direction and use accessor functions to access them in most places.

Convert a bunch of places to use switch-statements to access them rather
then chains of bitwise-AND statements. This makes it easier to add further
iterator types. Also, this can be more efficient as to implement a switch
of small contiguous integers, the compiler can use ~50% fewer compare
instructions than it has to use bitwise-and instructions.

Further, cease passing the iterator type into the iterator setup function.
The iterator function can set that itself. Only the direction is required.

Signed-off-by: David Howells <dhowells@redhat.com>


# 0d9c1ab3 15-Oct-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: preallocate message data items

Currently message data items are allocated with ceph_msg_data_create()
in setup_request_data() inside send_request(). send_request() has never
been allowed to fail, so each allocation is followed by a BUG_ON:

data = ceph_msg_data_create(...);
BUG_ON(!data);

It's been this way since support for multiple message data items was
added in commit 6644ed7b7e04 ("libceph: make message data be a pointer")
in 3.10.

There is no reason to delay the allocation of message data items until
the last possible moment and we certainly don't need a linked list of
them as they are only ever appended to the end and never erased. Make
ceph_msg_new2() take max_data_items and adapt the rest of the code.

Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 89486833 28-Sep-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()

Because send_mds_reconnect() wants to send a message with a pagelist
and pass the ownership to the messenger, ceph_msg_data_add_pagelist()
consumes a ref which is then put in ceph_msg_data_destroy(). This
makes managing pagelists in the OSD client (where they are wrapped in
ceph_osd_data) unnecessarily hard because the handoff only happens in
ceph_osdc_start_request() instead of when the pagelist is passed to
ceph_osd_data_pagelist_init(). I counted several memory leaks on
various error paths.

Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in
ceph_osd_data.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 130f52f2 27-Jul-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: check authorizer reply/challenge length before reading

Avoid scribbling over memory if the received reply/challenge is larger
than the buffer supplied with the authorizer.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 6daca13d 27-Jul-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: add authorizer challenge

When a client authenticates with a service, an authorizer is sent with
a nonce to the service (ceph_x_authorize_[ab]) and the service responds
with a mutation of that nonce (ceph_x_authorize_reply). This lets the
client verify the service is who it says it is but it doesn't protect
against a replay: someone can trivially capture the exchange and reuse
the same authorizer to authenticate themselves.

Allow the service to reject an initial authorizer with a random
challenge (ceph_x_authorize_challenge). The client then has to respond
with an updated authorizer proving they are able to decrypt the
service's challenge and that the new authorizer was produced for this
specific connection instance.

The accepting side requires this challenge and response unconditionally
if the client side advertises they have CEPHX_V2 feature bit.

This addresses CVE-2018-1128.

Link: http://tracker.ceph.com/issues/24836
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# c0f56b48 26-Jul-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: factor out __prepare_write_connect()

Will be used for sending ceph_msg_connect with an updated authorizer,
after the server challenges the initial authorizer.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 262614c4 26-Jul-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: store ceph_auth_handshake pointer in ceph_connection

We already copy authorizer_reply_buf and authorizer_reply_buf_len into
ceph_connection. Factoring out __prepare_write_connect() requires two
more: authorizer_buf and authorizer_buf_len. Store the pointer to the
handshake in con->auth rather than piling on.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 473bd2d7 13-Jul-2018 Arnd Bergmann <arnd@arndb.de>

libceph: use timespec64 in for keepalive2 and ticket validity

ceph_con_keepalive_expired() is the last user of timespec_add() and some
of the last uses of ktime_get_real_ts(). Replacing this with timespec64
based interfaces lets us remove that deprecated API.

I'm introducing new ceph_encode_timespec64()/ceph_decode_timespec64()
here that take timespec64 structures and convert to/from ceph_timespec,
which is defined to have an unsigned 32-bit tv_sec member. This extends
the range of valid times to year 2106, avoiding the year 2038 overflow.

The ceph file system portion still uses the old functions for inode
timestamps, this will be done separately after the VFS layer is converted.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e5c93883 27-Apr-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: use MSG_TRUNC for discarding received bytes

Avoid a copy into the "skip buffer".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d2935d6f 24-Apr-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: get rid of more_kvec in try_write()

All gotos to "more" are conditioned on con->state == OPEN, but the only
thing "more" does is opening the socket if con->state == PREOPEN. Kill
that label and rename "more_kvec" to "more".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>


# 9c55ad1c 24-Apr-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: validate con->state at the top of try_write()

ceph_con_workfn() validates con->state before calling try_read() and
then try_write(). However, try_read() temporarily releases con->mutex,
notably in process_message() and ceph_con_in_msg_alloc(), opening the
window for ceph_con_close() to sneak in, close the connection and
release con->sock. When try_write() is called on the assumption that
con->state is still valid (i.e. not STANDBY or CLOSED), a NULL sock
gets passed to the networking stack:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: selinux_socket_sendmsg+0x5/0x20

Make sure con->state is valid at the top of try_write() and add an
explicit BUG_ON for this, similar to try_read().

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/23706
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>


# 57a35dfb 10-Mar-2018 Chengguang Xu <cgxu519@gmx.com>

libceph, ceph: add __init attribution to init funcitons

Add __init attribution to the functions which are called only once
during initiating/registering operations and deleting unnecessary
symbol exports.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 45a267db 22-Jan-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: handle zero-length data items

rbd needs this for null copyups -- if copyup data is all zeroes, we
want to save some I/O and network bandwidth. See rbd_obj_issue_copyup()
in the next commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# b9e281c2 20-Jan-2018 Ilya Dryomov <idryomov@gmail.com>

libceph: introduce BVECS data type

In preparation for rbd "fancy" striping, introduce ceph_bvec_iter for
working with bio_vec array data buffers. The wrappers are trivial, but
make it look similar to ceph_bio_iter.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5359a17d 20-Jan-2018 Ilya Dryomov <idryomov@gmail.com>

libceph, rbd: new bio handling code (aka don't clone bios)

The reason we clone bios is to be able to give each object request
(and consequently each ceph_osd_data/ceph_msg_data item) its own
pointer to a (list of) bio(s). The messenger then initializes its
cursor with cloned bio's ->bi_iter, so it knows where to start reading
from/writing to. That's all the cloned bios are used for: to determine
each object request's starting position in the provided data buffer.

Introduce ceph_bio_iter to do exactly that -- store position within bio
list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 18370b36 14-Oct-2017 Gustavo A. R. Silva <garsilva@embeddedor.com>

ceph: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
[idryomov@gmail.com: amended "Older OSDs" comment]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4690faf0 26-Jul-2017 Ilya Dryomov <idryomov@gmail.com>

libceph: don't call ->reencode_message() more than once per message

Reencoding an already reencoded message is a bad idea. This could
happen on Policy::stateful_server connections (!CEPH_MSG_CONNECT_LOSSY),
such as MDS sessions.

This didn't pop up in testing because currently only OSD requests are
reencoded and OSD sessions are always lossy.

Fixes: 98ad5ebd1505 ("libceph: ceph_connection_operations::reencode_message() method")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>


# 7c40b22f 17-Jul-2017 Dan Carpenter <dan.carpenter@oracle.com>

libceph: potential NULL dereference in ceph_msg_data_create()

If kmem_cache_zalloc() returns NULL then the INIT_LIST_HEAD(&data->links);
will Oops. The callers aren't really prepared for NULL returns so it
doesn't make a lot of difference in real life.

Fixes: 5240d9f95dfe ("libceph: replace message data pointer with list")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 98ad5ebd 15-Jun-2017 Ilya Dryomov <idryomov@gmail.com>

libceph: ceph_connection_operations::reencode_message() method

Give upper layers a chance to reencode the message after the connection
is negotiated and ->peer_features is set. OSD client will use this to
support both luminous and pre-luminous OSDs (in a single cluster): the
former need MOSDOp v8; the latter will continue to be sent MOSDOp v4.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# dcbbd97c 05-Jun-2017 Ilya Dryomov <idryomov@gmail.com>

libceph: remove ceph_sanitize_features() workaround

Reflects ceph.git commit ff1959282826ae6acd7134e1b1ede74ffd1cc04a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0a2ad541 05-May-2017 Yan, Zheng <zyan@redhat.com>

libceph: cleanup old messages according to reconnect seq

when reopen a connection, use 'reconnect seq' to clean up
messages that have already been received by peer.

Link: http://tracker.ceph.com/issues/18690
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1759f7b0 19-May-2017 Ilya Dryomov <idryomov@gmail.com>

libceph: make ceph_msg_data_advance() return void

Both callers ignore the returned bool.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# 1134e091 08-May-2017 Deepa Dinamani <deepa.kernel@gmail.com>

fs: ceph: CURRENT_TIME with ktime_get_real_ts()

CURRENT_TIME is not y2038 safe. The macro will be deleted and all the
references to it will be replaced by ktime_get_* apis.

struct timespec is also not y2038 safe. Retain timespec for timestamp
representation here as ceph uses it internally everywhere. These
references will be changed to use struct timespec64 in a separate patch.

The current_fs_time() api is being changed to use vfs struct inode* as
an argument instead of struct super_block*.

Set the new mds client request r_stamp field using ktime_get_real_ts()
instead of using current_fs_time().

Also, since r_stamp is used as mtime on the server, use timespec_trunc()
to truncate the timestamp, using the right granularity from the
superblock.

This api will be transitioned to be y2038 safe along with vfs.

Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
M: Ilya Dryomov <idryomov@gmail.com>
M: "Yan, Zheng" <zyan@redhat.com>
M: Sage Weil <sage@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 633ee407 21-Mar-2017 Ilya Dryomov <idryomov@gmail.com>

libceph: force GFP_NOIO for socket allocations

sock_alloc_inode() allocates socket+inode and socket_wq with
GFP_KERNEL, which is not allowed on the writeback path:

Workqueue: ceph-msgr con_work [libceph]
ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
[<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
[<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
[<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
[<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
[<ffffffff81086335>] flush_work+0x165/0x250
[<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
[<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
[<ffffffff816d6b42>] ? __slab_free+0xee/0x234
[<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
[<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
[<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
[<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
[<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
[<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
[<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[<ffffffff811c0c18>] super_cache_scan+0x178/0x180
[<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
[<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
[<ffffffff8115af70>] shrink_slab+0x100/0x140
[<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
[<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
[<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
[<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
[<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
[<ffffffff811a0ac5>] new_slab+0x2c5/0x390
[<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
[<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
[<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
[<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
[<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
[<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
[<ffffffff811d8566>] alloc_inode+0x26/0xa0
[<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
[<ffffffff815b933e>] sock_alloc+0x1e/0x80
[<ffffffff815ba855>] __sock_create+0x95/0x220
[<ffffffff815baa04>] sock_create_kern+0x24/0x30
[<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
[<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90

Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.

Cc: stable@vger.kernel.org # 3.10+, needs backporting
Link: http://tracker.ceph.com/issues/19309
Reported-by: Sergey Jerusalimov <wintchester@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>


# 2c935bc5 14-Nov-2016 Peter Zijlstra <peterz@infradead.org>

locking/atomic, kref: Add kref_read()

Since we need to change the implementation, stop exposing internals.

Provide kref_read() to read the current reference count; typically
used for debug messages.

Kills two anti-patterns:

atomic_read(&kref->refcount)
kref->refcount.counter

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 61ff6e9b 09-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

ceph_tcp_sendpage(): use ITER_BVEC sendmsg

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 100803a8 14-Nov-2015 Al Viro <viro@zeniv.linux.org.uk>

ceph: switch to sock_recvmsg()

... and use ITER_BVEC instead of playing with kmap()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b3bbd3f2 02-Dec-2016 Ilya Dryomov <idryomov@gmail.com>

libceph: no need to drop con->mutex for ->get_authorizer()

->get_authorizer(), ->verify_authorizer_reply(), ->sign_message() and
->check_message_signature() shouldn't be doing anything with or on the
connection (like closing it or sending messages).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 0dde5848 02-Dec-2016 Ilya Dryomov <idryomov@gmail.com>

libceph: drop len argument of *verify_authorizer_reply()

The length of the reply is protocol-dependent - for cephx it's
ceph_x_authorize_reply. Nothing sensible can be passed from the
messenger layer anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 5c056fdc 02-Dec-2016 Ilya Dryomov <idryomov@gmail.com>

libceph: verify authorize reply on connect

After sending an authorizer (ceph_x_authorize_a + ceph_x_authorize_b),
the client gets back a ceph_x_authorize_reply, which it is supposed to
verify to ensure the authenticity and protect against replay attacks.
The code for doing this is there (ceph_x_verify_authorizer_reply(),
ceph_auth_verify_authorizer_reply() + plumbing), but it is never
invoked by the the messenger.

AFAICT this goes back to 2009, when ceph authentication protocols
support was added to the kernel client in 4e7a5dcd1bba ("ceph:
negotiate authentication protocol; implement AUTH_NONE protocol").

The second param of ceph_connection_operations::verify_authorizer_reply
is unused all the way down. Pass 0 to facilitate backporting, and kill
it in the next commit.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ee61e95 13-Mar-2016 Geliang Tang <geliangtang@163.com>

libceph: use KMEM_CACHE macro

Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 89f08173 20-Feb-2016 Ilya Dryomov <idryomov@gmail.com>

libceph: use sizeof_footer() more

Don't open-code sizeof_footer() in read_partial_message() and
ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now
possible to use con_out_kvec_add() in prepare_write_message_footer().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# dbc0d3ca 19-Feb-2016 Ilya Dryomov <idryomov@gmail.com>

libceph: use the right footer size when skipping a message

ceph_msg_footer is 21 bytes long, while ceph_msg_footer_old is only 13.
Don't skip too much when CEPH_FEATURE_MSG_AUTH isn't negotiated.

Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# e7a88e82 17-Feb-2016 Ilya Dryomov <idryomov@gmail.com>

libceph: don't bail early from try_read() when skipping a message

The contract between try_read() and try_write() is that when called
each processes as much data as possible. When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more. try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.

Cc: stable@vger.kernel.org # 3.10+
Reported-by: Varada Kari <Varada.Kari@sandisk.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Varada Kari <Varada.Kari@sandisk.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# f6330cc1 13-Jan-2016 Ilya Dryomov <idryomov@gmail.com>

libceph: clear messenger auth_retry flag if we fault

Commit 20e55c4cc758 ("libceph: clear messenger auth_retry flag when we
authenticate") got us only half way there. We clear the flag if the
second attempt succeeds, but it also needs to be cleared if that
attempt fails, to allow for the exponential backoff to kick in.
Otherwise, if ->should_authenticate() thinks our keys are valid, we
will busy loop, incrementing auth_retry to no avail:

process_connect ffff880079a63830 got BADAUTHORIZER attempt 1
process_connect ffff880079a63830 got BADAUTHORIZER attempt 2
process_connect ffff880079a63830 got BADAUTHORIZER attempt 3
process_connect ffff880079a63830 got BADAUTHORIZER attempt 4
process_connect ffff880079a63830 got BADAUTHORIZER attempt 5
...

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 67645d76 28-Dec-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: fix ceph_msg_revoke()

There are a number of problems with revoking a "was sending" message:

(1) We never make any attempt to revoke data - only kvecs contibute to
con->out_skip. However, once the header (envelope) is written to the
socket, our peer learns data_len and sets itself to expect at least
data_len bytes to follow front or front+middle. If ceph_msg_revoke()
is called while the messenger is sending message's data portion,
anything we send after that call is counted by the OSD towards the now
revoked message's data portion. The effects vary, the most common one
is the eventual hang - higher layers get stuck waiting for the reply to
the message that was sent out after ceph_msg_revoke() returned and
treated by the OSD as a bunch of data bytes. This is what Matt ran
into.

(2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs
is wrong. If ceph_msg_revoke() is called before the tag is sent out or
while the messenger is sending the header, we will get a connection
reset, either due to a bad tag (0 is not a valid tag) or a bad header
CRC, which kind of defeats the purpose of revoke. Currently the kernel
client refuses to work with header CRCs disabled, but that will likely
change in the future, making this even worse.

(3) con->out_skip is not reset on connection reset, leading to one or
more spurious connection resets if we happen to get a real one between
con->out_skip is set in ceph_msg_revoke() and before it's cleared in
write_partial_skip().

Fixing (1) and (3) is trivial. The idea behind fixing (2) is to never
zero the tag or the header, i.e. send out tag+header regardless of when
ceph_msg_revoke() is called. That way the header is always correct, no
unnecessary resets are induced and revoke stands ready for disabled
CRCs. Since ceph_msg_revoke() rips out con->out_msg, introduce a new
"message out temp" and copy the header into it before sending.

Cc: stable@vger.kernel.org # 4.0+
Reported-by: Matt Conner <matt.conner@keepertech.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Matt Conner <matt.conner@keepertech.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 10bcee14 18-Dec-2015 Geliang Tang <geliangtang@163.com>

libceph: use list_for_each_entry_safe

Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
[idryomov@gmail.com: nuke call to list_splice_init() as well]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 17ddc49b 16-Nov-2015 Geliang Tang <geliangtang@163.com>

libceph: use list_next_entry instead of list_entry_next

list_next_entry has been defined in list.h, so I replace list_entry_next
with it.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 583d0fef 02-Nov-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: clear msg->con in ceph_msg_release() only

The following bit in ceph_msg_revoke_incoming() is unsafe:

struct ceph_connection *con = msg->con;
if (!con)
return;
mutex_lock(&con->mutex);
<more msg->con use>

There is nothing preventing con from getting destroyed right after
msg->con test. One easy way to reproduce this is to disable message
signing only on the server side and try to map an image. The system
will go into a

libceph: read_partial_message ffff880073f0ab68 signature check failed
libceph: osd0 192.168.255.155:6801 bad crc/signature
libceph: read_partial_message ffff880073f0ab68 signature check failed
libceph: osd0 192.168.255.155:6801 bad crc/signature

loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are
likely to end up with a random GP fault if the reset handler executes
"within" ceph_msg_revoke_incoming():

<yet another reply w/o a signature>
...
<Ctrl-C>
rbd_obj_request_end
ceph_osdc_cancel_request
__unregister_request
ceph_osdc_put_request
ceph_msg_revoke_incoming
...
osd_reset
__kick_osd_requests
__reset_osd
remove_osd
ceph_con_close
reset_connection
<clear con->in_msg->con>
<put con ref>
put_osd
<free osd/con>
<msg->con use> <-- !!!

If ceph_msg_revoke_incoming() executes "before" the reset handler,
osd/con will be leaked because ceph_msg_revoke_incoming() clears
con->in_msg but doesn't put con ref, while reset_connection() only puts
con ref if con->in_msg != NULL.

The current msg->con scheme was introduced by commits 38941f8031bf
("libceph: have messages point to their connection") and 92ce034b5a74
("libceph: have messages take a connection reference"), which defined
when messages get associated with a connection and when that
association goes away. Part of the problem is that this association is
supposed to go away in much too many places; closing this race entirely
requires either a rework of the existing or an addition of a new layer
of synchronization.

In lieu of that, we can make it *much* less likely to hit by
disassociating messages only on their destruction and resend through
a different connection. This makes the code simpler and is probably
a good thing to do regardless - this patch adds a msg_con_set() helper
which is is called from only three places: ceph_con_send() and
ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear
it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a51983e4 28-Oct-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: add nocephx_sign_messages option

Support for message signing was merged into 3.19, along with
nocephx_require_signatures option. But, all that option does is allow
the kernel client to talk to clusters that don't support MSG_AUTH
feature bit. That's pretty useless, given that it's been supported
since bobtail.

Meanwhile, if one disables message signing on the server side with
"cephx sign messages = false", it becomes impossible to use the kernel
client since it expects messages to be signed if MSG_AUTH was
negotiated. Add nocephx_sign_messages option to support this use case.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 859bff51 28-Oct-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: stop duplicating client fields in messenger

supported_features and required_features serve no purpose at all, while
nocrc and tcp_nodelay belong to ceph_options::flags.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 79dbd1ba 26-Oct-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: msg signing callouts don't need con argument

We can use msg->con instead - at the point we sign an outgoing message
or check the signature on the incoming one, msg->con is always set. We
wouldn't know how to sign a message without an associated session (i.e.
msg->con == NULL) and being able to sign a message using an explicitly
provided authorizer is of no use.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 343128ce 19-Oct-2015 Shraddha Barke <shraddha.6596@gmail.com>

libceph: use local variable cursor instead of &msg->cursor

Use local variable cursor in place of &msg->cursor in
read_partial_msg_data() and write_partial_msg_data().

Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 7f61f545 14-Sep-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: don't access invalid memory in keepalive2 path

This

struct ceph_timespec ceph_ts;
...
con_out_kvec_add(con, sizeof(ceph_ts), &ceph_ts);

wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet
ceph_ts becomes invalid on return from prepare_write_keepalive(). As
a result, we send out bogus keepalive2 stamps. Fix this by encoding
into a ceph_timespec member, similar to how acks are read and written.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>


# d15f9d69 02-Sep-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: check data_len in ->alloc_msg()

Only ->alloc_msg() should check data_len of the incoming message
against the preallocated ceph_msg, doing it in the messenger is not
right. The contract is that either ->alloc_msg() returns a ceph_msg
which will fit all of the portions of the incoming message, or it
returns NULL and possibly sets skip, signaling whether NULL is due to
an -ENOMEM. ->alloc_msg() should be the only place where we make the
skip/no-skip decision.

I stumbled upon this while looking at con/osd ref counting. Right now,
if we get a non-extent message with a larger data portion than we are
prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip
it in the messenger, we don't put the con/osd ref acquired in
ceph_con_in_msg_alloc() (which is normally put in process_message()),
so this also fixes a memory leak.

An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't
corrupt random memory should a buggy ->alloc_msg() return an unfit
ceph_msg.

While at it, I changed the "unknown tid" dout() to a pr_warn() to make
sure all skips are seen and unified format strings.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# 8b9558aa 01-Sep-2015 Yan, Zheng <zyan@redhat.com>

libceph: use keepalive2 to verify the mon session is alive

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 68931622 03-Jul-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: rename con_work() to ceph_con_workfn()

Even though it's static, con_work(), being a work func, shows up in
various stacktraces a lot. Prefix it with ceph_.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d920ff6f 25-Jun-2015 Benoît Canet <benoit.canet@nodalink.com>

libceph: Avoid holding the zero page on ceph_msgr_slab_init errors

ceph_msgr_slab_init may fail due to a temporary ENOMEM.

Delay a bit the initialization of zero_page in ceph_msgr_init and
reorder its cleanup in _ceph_msgr_exit so it's done in reverse
order of setup.

BUG_ON() will not suffer to be postponed in case it is triggered.

Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c44bd69c 09-Jul-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: treat sockaddr_storage with uninitialized family as blank

addr_is_blank() should return true if family is neither AF_INET nor
AF_INET6. This is what its counterpart entity_addr_t::is_blank_ip() is
doing and it is the right thing to do: in process_banner() we check if
our address is blank and if it is "learn" it from our peer. As it is,
we never learn our address and always send out a blank one. This goes
way back to ceph.git commit dd732cbfc1c9 ("use sockaddr_storage; and
some ipv6 support groundwork") from 2009.

While at at, do not open-code ipv6_addr_any() and use INADDR_ANY
constant instead of 0.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 757856d2 25-Jun-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: enable ceph in a non-default network namespace

Grab a reference on a network namespace of the 'rbd map' (in case of
rbd) or 'mount' (in case of ceph) process and use that to open sockets
instead of always using init_net and bailing if network namespace is
anything but init_net. Be careful to not share struct ceph_client
instances between different namespaces and don't add any code in the
!CONFIG_NET_NS case.

This is based on a patch from Hong Zhiguo <zhiguohong@tencent.com>.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# c2cfa194 25-Jun-2015 Benoît Canet <benoit.canet@nodalink.com>

libceph: Fix ceph_tcp_sendpage()'s more boolean usage

From struct ceph_msg_data_cursor in include/linux/ceph/messenger.h:

bool last_piece; /* current is last piece */

In ceph_msg_data_next():

*last_piece = cursor->last_piece;

A call to ceph_msg_data_next() is followed by:

ret = ceph_tcp_sendpage(con->sock, page, page_offset,
length, last_piece);

while ceph_tcp_sendpage() is:

static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
int offset, size_t size, bool more)

The logic is inverted: correct it.

Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6ba8edc0 24-Jun-2015 Benoît Canet <benoit.canet@nodalink.com>

libceph: Remove spurious kunmap() of the zero page

ceph_tcp_sendpage already does the work of mapping/unmapping
the zero page if needed.

Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# eeb1bd5c 08-May-2015 Eric W. Biederman <ebiederm@xmission.com>

net: Add a struct net parameter to sock_create_kern

This is long overdue, and is part of cleaning up how we allocate kernel
sockets that don't reference count struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67c64eb7 23-Mar-2015 Ilya Dryomov <idryomov@gmail.com>

libceph: don't overwrite specific con error msgs

- specific con->error_msg messages (e.g. "protocol version mismatch")
end up getting overwritten by a catch-all "socket error on read
/ write", introduced in commit 3a140a0d5c4b ("libceph: report socket
read/write error message")
- "bad message sequence # for incoming message" loses to "bad crc" due
to the fact that -EBADMSG is used for both

Fix it, and tidy up con->error_msg assignments and pr_errs while at it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6d7fdb0a 02-Apr-2015 Ilya Dryomov <idryomov@gmail.com>

Revert "libceph: use memalloc flags for net IO"

This reverts commit 89baaa570ab0b476db09408d209578cfed700e9f.

Dirty page throttling should be sufficient for us in the general case
so there is no need to use __GFP_MEMALLOC - it would be needed only in
the swap-over-rbd case, which we currently don't support. (It would
probably take approximately the commit that is being reverted to add
that support, but we would also need the "swap" option to distinguish
from the general case and make sure swap ceph_client-s aren't shared
with anything else.) See ceph-devel threads [1] and [2] for the
details of why enabling pfmemalloc reserves for all cases is a bad
thing.

On top of potential system lockups related to drained emergency
reserves, this turned out to cause ceph lockups in case peers are on
the same host and communicating via loopback due to sk_filter()
dropping pfmemalloc skbs on the receiving side because the receiving
loopback socket is not tagged with SOCK_MEMALLOC.

[1] "SOCK_MEMALLOC vs loopback"
http://www.spinics.net/lists/ceph-devel/msg22998.html
[2] "[PATCH] libceph: don't set memalloc flags in loopback case"
http://www.spinics.net/lists/ceph-devel/msg23392.html

Conflicts:
net/ceph/messenger.c [ context: tcp_nodelay option ]

Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sage Weil <sage@redhat.com>
Cc: stable@vger.kernel.org # 3.18+, needs backporting
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Mel Gorman <mgorman@suse.de>


# ba988f87 23-Jan-2015 Chaitanya Huilgol <chaitanya.huilgol@gmail.com>

libceph: tcp_nodelay support

TCP_NODELAY socket option set on connection sockets,
disables Nagle’s algorithm and improves latency characteristics.
tcp_nodelay(default)/notcp_nodelay option flags provided to
enable/disable setting the socket option.

Signed-off-by: Chaitanya Huilgol <chaitanya.huilgol@sandisk.com>
[idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>


# 33d07337 04-Nov-2014 Yan, Zheng <zyan@redhat.com>

libceph: message signature support

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 4965fc38 23-Oct-2014 Ilya Dryomov <idryomov@redhat.com>

libceph: nuke ceph_kvfree()

Use kvfree() from linux/mm.h instead, which is identical. Also fix the
ceph_buffer comment: we will allocate with kmalloc() up to 32k - the
value of PAGE_ALLOC_COSTLY_ORDER, but that really is just an
implementation detail so don't mention it at all.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>


# 89baaa57 16-Oct-2014 Mike Christie <michaelc@cs.wisc.edu>

libceph: use memalloc flags for net IO

This patch has ceph's lib code use the memalloc flags.

If the VM layer needs to write data out to free up memory to handle new
allocation requests, the block layer must be able to make forward progress.
To handle that requirement we use structs like mempools to reserve memory for
objects like bios and requests.

The problem is when we send/receive block layer requests over the network
layer, net skb allocations can fail and the system can lock up.
To solve this, the memalloc related flags were added. NBD, iSCSI
and NFS uses these flags to tell the network/vm layer that it should
use memory reserves to fullfill allcation requests for structs like
skbs.

I am running ceph in a bunch of VMs in my laptop, so this patch was
not tested very harshly.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>


# f9865f06 10-Oct-2014 Ilya Dryomov <idryomov@redhat.com>

libceph: ceph-msgr workqueue needs a resque worker

Commit f363e45fd118 ("net/ceph: make ceph_msgr_wq non-reentrant")
effectively removed WQ_MEM_RECLAIM flag from ceph_msgr_wq. This is
wrong - libceph is very much a memory reclaim path, so restore it.

Cc: stable@vger.kernel.org # needs backporting for < 3.12
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>


# e4339d28 16-Sep-2014 Yan, Zheng <zyan@redhat.com>

libceph: reference counting pagelist

this allow pagelist to present data that may be sent multiple times.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# b9a67899 09-Sep-2014 Joe Perches <joe@perches.com>

libceph: Convert pr_warning to pr_warn

Use the more common pr_warn.

Other miscellanea:

o Coalesce formats
o Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>


# 5f740d7e 07-Aug-2014 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly

Determining ->last_piece based on the value of ->page_offset + length
is incorrect because length here is the length of the entire message.
->last_piece set to false even if page array data item length is <=
PAGE_SIZE, which results in invalid length passed to
ceph_tcp_{send,recv}page() and causes various asserts to fire.

# cat pages-cursor-init.sh
#!/bin/bash
rbd create --size 10 --image-format 2 foo
FOO_DEV=$(rbd map foo)
dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
# rbd_resize calls librbd rbd_resize(), size is in bytes
./rbd_resize bar $(((4 << 20) + 512))
rbd resize --size 10 bar
BAR_DEV=$(rbd map bar)
# trigger a 512-byte copyup -- 512-byte page array data item
dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5

The problem exists only in ceph_msg_data_pages_cursor_init(),
ceph_msg_data_pages_advance() does the right thing. The size_t cast is
unnecessary.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# 37ab77ac 24-Jun-2014 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: drop osd ref when canceling con work

queue_con() bumps osd ref count. We should do the reverse when
canceling con work.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# 0215e44b 20-Jun-2014 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: move and add dout()s to ceph_msg_{get,put}()

Add dout()s to ceph_msg_{get,put}(). Also move them to .c and turn
kref release callback into a static function.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# 178eda29 22-Apr-2014 Chunwei Chen <tuxoko@gmail.com>

libceph: fix corruption when using page_count 0 page in rbd

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

https://github.com/zfsonlinux/spl/issues/241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <sage@inktank.com>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: stable@vger.kernel.org
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>


# 676d2369 11-Apr-2014 David S. Miller <davem@davemloft.net>

net: Fix use after free by removing length arg from sk_data_ready callbacks.

Several spots in the kernel perform a sequence like:

skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller <davem@davemloft.net>


# d90deda6 22-Mar-2014 Yan, Zheng <zheng.z.yan@intel.com>

libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance()

When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 0ec1d15e 05-Feb-2014 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: do not dereference a NULL bio pointer

Commit f38a5181d9f3 ("ceph: Convert to immutable biovecs") introduced
a NULL pointer dereference, which broke rbd in -rc1. Fix it.

Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# eeb0bed5 09-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: add ceph_kv{malloc,free}() and switch to them

Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.

ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set. This changes the existing
behaviour:

- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
and using vmalloc() just as a fallback

- for messages (ceph_msg_new()), from going to vmalloc() for anything
bigger than a page

- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
memory

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# f2be82b0 09-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: fix preallocation check in get_reply()

The check that makes sure that we have enough memory allocated to read
in the entire header of the message in question is currently busted.
It compares front_len of the incoming message with iov_len field of
ceph_msg::front structure, which is used primarily to indicate the
amount of data already read in, and not the size of the allocated
buffer. Under certain conditions (e.g. a short read from a socket
followed by that socket's shutdown and owning ceph_connection reset)
this results in a warning similar to

[85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0)

and, through another bug, leads to forever hung tasks and forced
reboots. Fix this by comparing front_len with front_alloc_len field of
struct ceph_msg, which stores the actual size of the buffer.

Fixes: http://tracker.ceph.com/issues/5425

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 3cea4c30 09-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: rename ceph_msg::front_max to front_alloc_len

Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# f48db1e9 30-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: use CEPH_MON_PORT when the specified port is 0

Similar to userspace, don't bail with "parse_ips bad ip ..." if the
specified port is port 0, instead use port CEPH_MON_PORT (6789, the
default monitor port).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 2b3e0c90 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: update ceph_features.h

This updates ceph_features.h so that it has all feature bits defined in
ceph.git. In the interim since the last update, ceph.git crossed the
"32 feature bits" point, and, the addition of the 33rd bit wasn't
handled correctly. The work-around is squashed into this commit and
reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 12b4629a 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com>

libceph: all features fields must be u64

In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64. (ceph.git has ~40 feature bits at this
point.)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# f38a5181 07-Aug-2013 Kent Overstreet <kmo@daterainc.com>

ceph: Convert to immutable biovecs

Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org


# 4d1829a5 30-Jul-2013 Tejun Heo <tj@kernel.org>

ceph: WQ_NON_REENTRANT is meaningless and going away

dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org


# 64dc6130 22-Jul-2013 Eric Dumazet <edumazet@google.com>

net: add sk_stream_is_writeable() helper

Several call sites use the hardcoded following condition :

sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)

Lets use a helper because TCP_NOTSENT_LOWAT support will change this
condition for TCP sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 81b36be4 30-Apr-2013 Alex Elder <elder@inktank.com>

libceph: allocate ceph message data with a slab allocator

Create a slab cache to manage ceph_msg_data structure allocation.

This is part of:
http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# e3d5d638 30-Apr-2013 Alex Elder <elder@inktank.com>

libceph: allocate ceph messages with a slab allocator

Create a slab cache to manage ceph_msg structure allocation.

This is part of:
http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# a51b272e 19-Apr-2013 Alex Elder <elder@inktank.com>

libceph: fix two messenger bugs

This patch makes four small changes in the ceph messenger.

While getting copyup functionality working I found two bugs in the
messenger. Existing paths through the code did not trigger these
problems, but they're fixed here:
- In ceph_msg_data_pagelist_cursor_init(), the cursor's
last_piece field was being checked against the length
supplied. This was OK until this commit: ccba6d98 libceph:
implement multiple data items in a message That commit changed
the cursor init routines to allow lengths to be supplied that
exceeded the size of the current data item. Because of this,
we have to use the assigned cursor resid field rather than the
provided length in determining whether the cursor points to
the last piece of a data item.
- In ceph_msg_data_add_pages(), a BUG_ON() was erroneously
catching attempts to add page data to a message if the message
already had data assigned to it. That was OK until that same
commit, at which point it was fine for messages to have
multiple data items. It slipped through because that BUG_ON()
call was present twice in that function. (You can never be too
careful.)

In addition two other minor things are changed:
- In ceph_msg_data_cursor_init(), the local variable "data" was
getting assigned twice.
- In ceph_msg_data_advance(), it was assumed that the
type-specific advance routine would set new_piece to true
after it advanced past the last piece. That may have been
fine, but since we check for that case we might as well set it
explicitly in ceph_msg_data_advance().

This resolves:
http://tracker.ceph.com/issues/4762

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 90af3602 05-Apr-2013 Alex Elder <elder@inktank.com>

libceph: add, don't set data for a message

Change the names of the functions that put data on a pagelist to
reflect that we're adding to whatever's already there rather than
just setting it to the one thing. Currently only one data item is
ever added to a message, but that's about to change.

This resolves:
http://tracker.ceph.com/issues/2770

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# ca8b3a69 05-Apr-2013 Alex Elder <elder@inktank.com>

libceph: implement multiple data items in a message

This patch adds support to the messenger for more than one data item
in its data list.

A message data cursor has two more fields to support this:
- a count of the number of bytes left to be consumed across
all data items in the list, "total_resid"
- a pointer to the head of the list (for validation only)

The cursor initialization routine has been split into two parts: the
outer one, which initializes the cursor for traversing the entire
list of data items; and the inner one, which initializes the cursor
to start processing a single data item.

When a message cursor is first initialized, the outer initialization
routine sets total_resid to the length provided. The data pointer
is initialized to the first data item on the list. From there, the
inner initialization routine finishes by setting up to process the
data item the cursor points to.

Advancing the cursor consumes bytes in total_resid. If the resid
field reaches zero, it means the current data item is fully
consumed. If total_resid indicates there is more data, the cursor
is advanced to point to the next data item, and then the inner
initialization routine prepares for using that. (A check is made at
this point to make sure we don't wrap around the front of the list.)

The type-specific init routines are modified so they can be given a
length that's larger than what the data item can support. The resid
field is initialized to the smaller of the provided length and the
length of the entire data item.

When total_resid reaches zero, we're done.

This resolves:
http://tracker.ceph.com/issues/3761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 5240d9f9 14-Mar-2013 Alex Elder <elder@inktank.com>

libceph: replace message data pointer with list

In place of the message data pointer, use a list head which links
through message data items. For now we only support a single entry
on that list.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 8ae4f4f5 14-Mar-2013 Alex Elder <elder@inktank.com>

libceph: have cursor point to data

Rather than having a ceph message data item point to the cursor it's
associated with, have the cursor point to a data item. This will
allow a message cursor to be used for more than one data item.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 36153ec9 14-Mar-2013 Alex Elder <elder@inktank.com>

libceph: move cursor into message

A message will only be processing a single data item at a time, so
there's no need for each data item to have its own cursor.

Move the cursor embedded in the message data structure into the
message itself. To minimize the impact, keep the data->cursor
field, but make it be a pointer to the cursor in the message.

Move the definition of ceph_msg_data above ceph_msg_data_cursor so
the cursor can point to the data without a forward definition rather
than vice-versa.

This and the upcoming patches are part of:
http://tracker.ceph.com/issues/3761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# c851c495 05-Apr-2013 Alex Elder <elder@inktank.com>

libceph: record bio length

The bio is the only data item type that doesn't record its full
length. Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# f759ebb9 05-Apr-2013 Alex Elder <elder@inktank.com>

libceph: skip message if too big to receive

We know the length of our message buffers. If we get a message
that's too long, just dump it and ignore it. If skip was set
then con->in_msg won't be valid, so be careful not to dereference
a null pointer in the process.

This resolves:
http://tracker.ceph.com/issues/4664

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# ea96571f 05-Apr-2013 Alex Elder <elder@inktank.com>

libceph: fix possible CONFIG_BLOCK build problem

This patch:
15a0d7b libceph: record message data length
did not enclose some bio-specific code inside CONFIG_BLOCK as
it should have. Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 98fa5dd8 01-Apr-2013 Alex Elder <elder@inktank.com>

libceph: provide data length when preparing message

In prepare_message_data(), the length used to initialize the cursor
is taken from the header of the message provided. I'm working
toward not using the header data length field to determine length in
outbound messages, and this is a step in that direction. For
inbound messages this will be set to be the actual number of bytes
that are arriving (which may be less than the total size of the data
buffer available).

This resolves:
http://tracker.ceph.com/issues/4589

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# a1930804 14-Mar-2013 Alex Elder <elder@inktank.com>

libceph: record message data length

Keep track of the length of the data portion for a message in a
separate field in the ceph_msg structure. This information has
been maintained in wire byte order in the message header, but
that's going to change soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 56fc5659 30-Mar-2013 Alex Elder <elder@inktank.com>

libceph: account for alignment in pages cursor

When a cursor for a page array data message is initialized it needs
to determine the initial value for cursor->last_piece. Currently it
just checks if length is less than a page, but that's not correct.
The data in the first page in the array will be offset by a page
offset based on the alignment recorded for the data. (All pages
thereafter will be aligned at the base of the page, so there's
no need to account for this except for the first page.)

Because this was wrong, there was a case where the length of a piece
would be calculated as all of the residual bytes in the message and
that plus the page offset could exceed the length of a page.

So fix this case. Make sure the sum won't wrap.

This resolves a third issue described in:
http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 5df521b1 30-Mar-2013 Alex Elder <elder@inktank.com>

libceph: page offset must be less than page size

Currently ceph_msg_data_pages_advance() allows the page offset value
to be PAGE_SIZE, apparently assuming ceph_msg_data_pages_next() will
treat it as 0. But that doesn't happen, and the result led to a
helpful assertion failure.

Change ceph_msg_data_pages_advance() to truncate the offset to 0
before returning if it reaches PAGE_SIZE.

Make a few other minor adjustments in this area (comments and a
better assertion) while modifying it.

This resolves a second issue described in:
http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 1190bf06 30-Mar-2013 Alex Elder <elder@inktank.com>

libceph: fix broken data length assertions

It's OK for the result of a read to come back with fewer bytes than
were requested. So don't trigger a BUG() in that case when
initializing the data cursor.

This resolves the first problem described in:
http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 6644ed7b 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: make message data be a pointer

Begin the transition from a single message data item to a list of
them by replacing the "data" structure in a message with a pointer
to a ceph_msg_data structure.

A null pointer will indicate the message has no data; replace the
use of ceph_msg_has_data() with a simple check for a null pointer.

Create functions ceph_msg_data_create() and ceph_msg_data_destroy()
to dynamically allocate and free a data item structure of a given type.

When a message has its data item "set," allocate one of these to
hold the data description, and free it when the last reference to
the message is dropped.

This partially resolves:
http://tracker.ceph.com/issues/4429

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 8ea299bc 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: use only ceph_msg_data_advance()

The *_msg_pos_next() functions do little more than call
ceph_msg_data_advance(). Replace those wrapper functions with
a simple call to ceph_msg_data_advance().

This cleanup is related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 143334ff 29-Mar-2013 Alex Elder <elder@inktank.com>

libceph: don't add to crc unless data sent

In write_partial_message_data() we aggregate the crc for the data
portion of the message as each new piece of the data item is
encountered. Because it was computed *before* sending the data, if
an attempt to send a new piece resulted in 0 bytes being sent, the
crc crc across that piece would erroneously get computed again and
added to the aggregate result. This would occasionally happen in
the evnet of a connection failure.

The crc value isn't really needed until the complete value is known
after sending all data, so there's no need to compute it before
sending.

So don't calculate the crc for a piece until *after* we know at
least one byte of it has been sent. That will avoid this problem.

This resolves:
http://tracker.ceph.com/issues/4450

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# f5db90bc 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: kill last of ceph_msg_pos

The only remaining field in the ceph_msg_pos structure is
did_page_crc. In the new cursor model of things that flag (or
something like it) belongs in the cursor.

Define a new field "need_crc" in the cursor (which applies to all
types of data) and initialize it to true whenever a cursor is
initialized.

In write_partial_message_data(), the data CRC still will be computed
as before, but it will check the cursor->need_crc field to determine
whether it's needed. Any time the cursor is advanced to a new piece
of a data item, need_crc will be set, and this will cause the crc
for that entire piece to be accumulated into the data crc.

In write_partial_message_data() the intermediate crc value is now
held in a local variable so it doesn't have to be byte-swapped so
many times. In read_partial_msg_data() we do something similar
(but mainly for consistency there).

With that, the ceph_msg_pos structure can go away, and it no longer
needs to be passed as an argument to prepare_message_data().

This cleanup is related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 859a35d5 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: kill most of ceph_msg_pos

All but one of the fields in the ceph_msg_pos structure are now
never used (only assigned), so get rid of them. This allows
several small blocks of code to go away.

This is cleanup of old code related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 643c68a4 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: use cursor resid for loop condition

Use the "resid" field of a cursor rather than finding when the
message data position has moved up to meet the data length to
determine when all data has been sent or received in
write_partial_message_data() and read_partial_msg_data().

This is cleanup of old code related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 4c59b4a2 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: collapse all data items into one

It turns out that only one of the data item types is ever used at
any one time in a single message (currently).
- A page array is used by the osd client (on behalf of the file
system) and by rbd. Only one osd op (and therefore at most
one data item) is ever used at a time by rbd. And the only
time the file system sends two, the second op contains no
data.
- A bio is only used by the rbd client (and again, only one
data item per message)
- A page list is used by the file system and by rbd for outgoing
data, but only one op (and one data item) at a time.

We can therefore collapse all three of our data item fields into a
single field "data", and depend on the messenger code to properly
handle it based on its type.

This allows us to eliminate quite a bit of duplicated code.

This is related to:
http://tracker.ceph.com/issues/4429

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 686be208 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: get rid of read helpers

Now that read_partial_message_pages() and read_partial_message_bio()
are literally identical functions we can factor them out. They're
pretty simple as well, so just move their relevant content into
read_partial_msg_data().

This is and previous patches together resolve:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 61fcdc97 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: no outbound zero data

There is handling in write_partial_message_data() for the case where
only the length of--and no other information about--the data to be
sent has been specified. It uses the zero page as the source of
data to send in this case.

This case doesn't occur. All message senders set up a page array,
pagelist, or bio describing the data to be sent. So eliminate the
block of code that handles this (but check and issue a warning for
now, just in case it happens for some reason).

This resolves:
http://tracker.ceph.com/issues/4426

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 878efabd 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: use cursor for inbound data pages

The cursor code for a page array selects the right page, page
offset, and length to use for a ceph_tcp_recvpage() call, so
we can use it to replace a block in read_partial_message_pages().

This partially resolves:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 6518be47 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: kill ceph message bio_iter, bio_seg

The bio_iter and bio_seg fields in a message are no longer used, we
use the cursor instead. So get rid of them and the functions that
operate on them them.

This is related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 463207aa 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: use cursor for bio reads

Replace the use of the information in con->in_msg_pos for incoming
bio data. The old in_msg_pos and the new cursor mechanism do
basically the same thing, just slightly differently.

The main functional difference is that in_msg_pos keeps track of the
length of the complete bio list, and assumed it was fully consumed
when that many bytes had been transferred. The cursor does not assume
a length, it simply consumes all bytes in the bio list. Because the
only user of bio data is the rbd client, and because the length of a
bio list provided by rbd client always matches the number of bytes
in the list, both ways of tracking length are equivalent.

In addition, for in_msg_pos the initial bio vector is selected as
the initial value of the bio->bi_idx, while the cursor assumes this
is zero. Again, the rbd client always passes 0 as the initial index
so the effect is the same.

Other than that, they basically match:
in_msg_pos cursor
---------- ------
bio_iter bio
bio_seg vec_index
page_pos page_offset

The in_msg_pos field is initialized by a call to init_bio_iter().
The bio cursor is initialized by ceph_msg_data_cursor_init().
Both now happen in the same spot, in prepare_message_data().

The in_msg_pos field is advanced by a call to in_msg_pos_next(),
which updates page_pos and calls iter_bio_next() to move to the next
bio vector, or to the next bio in the list. The cursor is advanced
by ceph_msg_data_advance(). That isn't currently happening so
add a call to that in in_msg_pos_next().

Finally, the next piece of data to use for a read is determined
by a bunch of lines in read_partial_message_bio(). Those can be
replaced by an equivalent ceph_msg_data_bio_next() call.

This partially resolves:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 25aff7c5 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: record residual bytes for all message data types

All of the data types can use this, not just the page array. Until
now, only the bio type doesn't have it available, and only the
initiator of the request (the rbd client) is able to supply the
length of the full request without re-scanning the bio list. Change
the cursor init routines so the length is supplied based on the
message header "data_len" field, and use that length to intiialize
the "resid" field of the cursor.

In addition, change the way "last_piece" is defined so it is based
on the residual number of bytes in the original request. This is
necessary (at least for bio messages) because it is possible for
a read request to succeed without consuming all of the space
available in the data buffer.

This resolves:
http://tracker.ceph.com/issues/4427

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 28a89dde 11-Mar-2013 Alex Elder <elder@inktank.com>

libceph: drop pages parameter

The value passed for "pages" in read_partial_message_pages() is
always the pages pointer from the incoming message, which can be
derived inside that function. So just get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 888334f9 25-Mar-2013 Alex Elder <elder@inktank.com>

libceph: initialize data fields on last msg put

When the last reference to a ceph message is dropped,
ceph_msg_last_put() is called to clean things up.

For "normal" messages (allocated via ceph_msg_new() rather than
being allocated from a memory pool) it's sufficient to just release
resources. But for a mempool-allocated message we actually have to
re-initialize the data fields in the message back to initial state
so they're ready to go in the event the message gets reused.

Some of this was already done; this fleshes it out so it's done
more completely.

This resolves:
http://tracker.ceph.com/issues/4540

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 20e55c4c 25-Mar-2013 Sage Weil <sage@inktank.com>

libceph: clear messenger auth_retry flag when we authenticate

We maintain a counter of failed auth attempts to allow us to retry once
before failing. However, if the second attempt succeeds, the flag isn't
cleared, which makes us think auth failed again later when the connection
resets for other reasons (like a socket error).

This is one part of the sorry sequence of events in bug

http://tracker.ceph.com/issues/4282

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 3a23083b 25-Mar-2013 Sage Weil <sage@inktank.com>

libceph: implement RECONNECT_SEQ feature

This is an old protocol extension that allows the client and server to
avoid resending old messages after a reconnect (following a socket error).
Instead, the exchange their sequence numbers during the handshake. This
avoids sending a bunch of useless data over the socket.

It has been supported in the server code since v0.22 (Sep 2010).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 8a166d05 08-Mar-2013 Alex Elder <elder@inktank.com>

libceph: more cleanup of write_partial_msg_pages()

Basically all cases in write_partial_msg_pages() use the cursor, and
as a result we can simplify that function quite a bit.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 9d2a06c2 08-Mar-2013 Alex Elder <elder@inktank.com>

libceph: kill message trail

The wart that is the ceph message trail can now be removed, because
its only user was the osd client, and the previous patch made that
no longer the case.

The result allows write_partial_msg_pages() to be simplified
considerably.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# e766d7b5 07-Mar-2013 Alex Elder <elder@inktank.com>

libceph: implement pages array cursor

Implement and use cursor routines for page array message data items
for outbound message data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 6aaa4511 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: implement bio message data item cursor

Implement and use cursor routines for bio message data items for
outbound message data.

(See the previous commit for reasoning in support of the changes
in out_msg_pos_next().)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 7fe1e5e5 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: use data cursor for message pagelist

Switch to using the message cursor for the (non-trail) outgoing
pagelist data item in a message if present.

Notes on the logic changes in out_msg_pos_next():
- only the mds client uses a ceph pagelist for message data;
- if the mds client ever uses a pagelist, it never uses a page
array (or anything else, for that matter) for data in the same
message;
- only the osd client uses the trail portion of a message data,
and when it does, it never uses any other data fields for
outgoing data in the same message; and finally
- only the rbd client uses bio message data (never pagelist).

Therefore out_msg_pos_next() can assume:
- if we're in the trail portion of a message, the message data
pagelist, data, and bio can be ignored; and
- if there is a page list, there will never be any a bio or page
array data, and vice-versa.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# dd236fcb 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: prepare for other message data item types

This just inserts some infrastructure in preparation for handling
other types of ceph message data items. No functional changes,
just trying to simplify review by separating out some noise.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# fe38a2b6 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: start defining message data cursor

This patch lays out the foundation for using generic routines to
manage processing items of message data.

For simplicity, we'll start with just the trail portion of a
message, because it stands alone and is only present for outgoing
data.

First some basic concepts. We'll use the term "data item" to
represent one of the ceph_msg_data structures associated with a
message. There are currently four of those, with single-letter
field names p, l, b, and t. A data item is further broken into
"pieces" which always lie in a single page. A data item will
include a "cursor" that will track state as the memory defined by
the item is consumed by sending data from or receiving data into it.

We define three routines to manipulate a data item's cursor: the
"init" routine; the "next" routine; and the "advance" routine. The
"init" routine initializes the cursor so it points at the beginning
of the first piece in the item. The "next" routine returns the
page, page offset, and length (limited by both the page and item
size) of the next unconsumed piece in the item. It also indicates
to the caller whether the piece being returned is the last one in
the data item.

The "advance" routine consumes the requested number of bytes in the
item (advancing the cursor). This is used to record the number of
bytes from the current piece that were actually sent or received by
the network code. It returns an indication of whether the result
means the current piece has been fully consumed. This is used by
the message send code to determine whether it should calculate the
CRC for the next piece processed.

The trail of a message is implemented as a ceph pagelist. The
routines defined for it will be usable for non-trail pagelist data
as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 43794509 01-Mar-2013 Alex Elder <elder@inktank.com>

libceph: abstract message data

Group the types of message data into an abstract structure with a
type indicator and a union containing fields appropriate to the
type of data it represents. Use this to represent the pages,
pagelist, bio, and trail in a ceph message.

Verify message data is of type NONE in ceph_msg_data_set_*()
routines. Since information about message data of type NONE really
should not be interpreted, get rid of the other assertions in those
functions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# f9e15777 01-Mar-2013 Alex Elder <elder@inktank.com>

libceph: be explicit about message data representation

A ceph message has a data payload portion. The memory for that data
(either the source of data to send or the location to place data
that is received) is specified in several ways. The ceph_msg
structure includes fields for all of those ways, but this
mispresents the fact that not all of them are used at a time.

Specifically, the data in a message can be in:
- an array of pages
- a list of pages
- a list of Linux bios
- a second list of pages (the "trail")
(The two page lists are currently only ever used for outgoing data.)

Impose more structure on the ceph message, making the grouping of
some of these fields explicit. Shorten the name of the
"page_alignment" field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 97fb1c7f 01-Mar-2013 Alex Elder <elder@inktank.com>

libceph: define ceph_msg_has_*() data macros

Define and use macros ceph_msg_has_*() to determine whether to
operate on the pages, pagelist, bio, and trail fields of a message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 35b62808 08-Mar-2013 Alex Elder <elder@inktank.com>

libceph: define and use ceph_crc32c_page()

Factor out a common block of code that updates a CRC calculation
over a range of data in a page.

This and the preceding patches are related to:
http://tracker.ceph.com/issues/4403

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# afb3d90e 08-Mar-2013 Alex Elder <elder@inktank.com>

libceph: define and use ceph_tcp_recvpage()

Define a new function ceph_tcp_recvpage() that behaves in a way
comparable to ceph_tcp_sendpage().

Rearrange the code in both read_partial_message_pages() and
read_partial_message_bio() so they have matching structure,
(similar to what's in write_partial_msg_pages()), and use
this new function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 34d2d200 08-Mar-2013 Alex Elder <elder@inktank.com>

libceph: encapsulate reading message data

Pull the code that reads the data portion into a message into
a separate function read_partial_msg_data().

Rename write_partial_msg_pages() to be write_partial_message_data()
to match its read counterpart, and to reflect its more generic
purpose.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# e387d525 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: small write_partial_msg_pages() refactor

Define local variables page_offset and length to represent the range
of bytes within a page that will be sent by ceph_tcp_sendpage() in
write_partial_msg_pages().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 78625051 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: consolidate message prep code

In prepare_write_message_data(), various fields are initialized in
preparation for writing message data out. Meanwhile, in
read_partial_message(), there is essentially the same block of code,
operating on message variables associated with an incoming message.

Generalize prepare_write_message_data() so it works for both
incoming and outcoming messages, and use it in both spots. The
did_page_crc is not used for input (so it's harmless to initialize
it).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# bae6acd9 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: use local variables for message positions

There are several places where a message's out_msg_pos or in_msg_pos
field is used repeatedly within a function. Use a local pointer
variable for this purpose to unclutter the code.

This and the upcoming cleanup patches are related to:
http://tracker.ceph.com/issues/4403

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 98a03708 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: don't clear bio_iter in prepare_write_message()

At one time it was necessary to clear a message's bio_iter field to
avoid a bad pointer dereference in write_partial_msg_pages().

That no longer seems to be the case. Here's why.

The message's bio fields represent (in this case) outgoing data.
Between where the bio_iter is made NULL in prepare_write_message()
and the call in that function to prepare_message_data(), the
bio fields are never used.

In prepare_message_data(), init-bio_iter() is called, and the result
of that overwrites the value in the message's bio_iter field.

Because it gets overwritten anyway, there is no need to set it to
NULL. So don't do it.

This resolves:
http://tracker.ceph.com/issues/4402

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 07aa1558 04-Mar-2013 Alex Elder <elder@inktank.com>

libceph: activate message data assignment checks

The mds client no longer tries to assign zero-length message data,
and the osd client no longer sets its data info more than once.
This allows us to activate assertions in the messenger to verify
these things never happen.

This resolves both of these:
http://tracker.ceph.com/issues/4263
http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# 4a73ef27 07-Mar-2013 Alex Elder <elder@inktank.com>

libceph: record message data byte length

Record the number of bytes of data in a page array rather than the
number of pages in the array. It can be assumed that the page array
is of sufficient size to hold the number of bytes indicated (and
offset by the indicated alignment).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# ebf18f47 04-Mar-2013 Alex Elder <elder@inktank.com>

ceph: only set message data pointers if non-empty

Change it so we only assign outgoing data information for messages
if there is outgoing data to send.

This then allows us to add a few more (currently commented-out)
assertions.

This is related to:
http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# 27fa8385 13-Feb-2013 Alex Elder <elder@inktank.com>

libceph: isolate other message data fields

Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
ceph_msg_data_set_trail() to clearly abstract the assignment of the
remaining data-related fields in a ceph message structure. Use the
new functions in the osd client and mds client.

This partially resolves:
http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# f1baeb2b 07-Mar-2013 Alex Elder <elder@inktank.com>

libceph: set page info with byte length

When setting page array information for message data, provide the
byte length rather than the page count ceph_msg_data_set_pages().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 02afca6c 13-Feb-2013 Alex Elder <elder@inktank.com>

libceph: isolate message page field manipulation

Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure. Use this new function in the osd client and mds
client.

Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that). At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once. (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)

Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.

This partially resolves:
http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 9516e45b 01-Mar-2013 Alex Elder <elder@inktank.com>

libceph: simplify new message initialization

Rather than explicitly initializing many fields to 0, NULL, or false
in a newly-allocated message, just use kzalloc() for allocating new
messages. This will become a much more convenient way of doing
things anyway for upcoming patches that abstract the data field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 35c7bfbc 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: advance pagelist with list_rotate_left()

While processing an outgoing pagelist (either the data pagelist or
trail) in a ceph message, the messenger cycles through each of the
pages on the list. This is accomplished in out_msg_pos_next(), if
the end of the first page on the list is reached, the first page is
moved to the end of the list.

There is a list operation, list_rotate_left(), which performs
exactly this operation, and by using it, what's really going on
becomes more obvious.

So replace these two list_move_tail() calls with list_rotate_left().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# e788182f 08-Mar-2013 Alex Elder <elder@inktank.com>

libceph: define and use in_msg_pos_next()

Define a new function in_msg_pos_next() to match out_msg_pos_next(),
and use it in place of code at the end of read_partial_message_pages()
and read_partial_message_bio().

Note that the page number is incremented and offset reset under
slightly different conditions from before. The result is
equivalent, however, as explained below.

Each time an incoming message is going to arrive, we find out how
much room is left--not surpassing the current page--and provide that
as the number of bytes to receive. So the amount we'll use is the
lesser of: all that's left of the entire request; and all that's
left in the current page.

If we received exactly how many were requested, we either reached
the end of the request or the end of the page. In the first case,
we're done, in the second, we move onto the next page in the array.

In all cases but (possibly) on the last page, after adding the
number of bytes received, page_pos == PAGE_SIZE. On the last page,
it doesn't really matter whether we increment the page number and
reset the page position, because we're done and we won't come back
here again. The code previously skipped over that last case,
basically. The new code handles that case the same as the others,
incrementing and resetting.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# b3d56fab 08-Mar-2013 Alex Elder <elder@inktank.com>

libceph: kill args in read_partial_message_bio()

There is only one caller for read_partial_message_bio(), and it
always passes &msg->bio_iter and &bio_seg as the second and third
arguments. Furthermore, the message in question is always the
connection's in_msg, and we can get that inside the called function.

So drop those two parameters and use their derived equivalents.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# e1dcb128 06-Mar-2013 Alex Elder <elder@inktank.com>

libceph: change type of ceph_tcp_sendpage() "more"

Change the type of the "more" parameter from int to bool.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 6ebc8b32 08-Mar-2013 Alex Elder <elder@inktank.com>

libceph: minor byte order problems in read_partial_message()

Some values printed are not (necessarily) in CPU order. We already
have a copy of the converted versions, so use them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 7b11ba37 08-Mar-2013 Alex Elder <elder@inktank.com>

libceph: define CEPH_MSG_MAX_MIDDLE_LEN

This is probably unnecessary but the code read as if it were wrong
in read_partial_message().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 4137577a 05-Mar-2013 Alex Elder <elder@inktank.com>

libceph: clean up skipped message logic

In ceph_con_in_msg_alloc() it is possible for a connection's
alloc_msg method to indicate an incoming message should be skipped.
By default, read_partial_message() initializes the skip variable
to 0 before it gets provided to ceph_con_in_msg_alloc().

The osd client, mon client, and mds client each supply an alloc_msg
method. The mds client always assigns skip to be 0.

The other two leave the skip value of as-is, or assigns it to zero,
except:
- if no (osd or mon) request having the given tid is found, in
which case skip is set to 1 and NULL is returned; or
- in the osd client, if the data of the reply message is not
adequate to hold the message to be read, it assigns skip
value 1 and returns NULL.
So the returned message pointer will always be NULL if skip is ever
non-zero.

Clean up the logic a bit in ceph_con_in_msg_alloc() to make this
state of affairs more obvious. Add a comment explaining how a null
message pointer can mean either a message that should be skipped or
a problem allocating a message.

This resolves:
http://tracker.ceph.com/issues/4324

Reported-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# 53ded495 01-Mar-2013 Alex Elder <elder@inktank.com>

libceph: define mds_alloc_msg() method

The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client. Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.

This and the next patch resolve:
http://tracker.ceph.com/issues/4322

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# 1d866d1c 01-Mar-2013 Alex Elder <elder@inktank.com>

libceph: drop mutex while allocating a message

In ceph_con_in_msg_alloc(), if no alloc_msg method is defined for a
connection a new message is allocated with ceph_msg_new().

Drop the mutex before making this call, and make sure we're still
connected when we get it back again.

This is preparing for the next patch, which ensures all connections
define an alloc_msg method, and then handles them all the same way.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# ec02a2f2 01-Mar-2013 Alex Elder <elder@inktank.com>

libceph: kill ceph_msg->pagelist_count

The pagelist_count field is never actually used, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# d4b515fa 25-Feb-2013 Alex Elder <elder@inktank.com>

libceph: distinguish page array and pagelist count

Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list. Currently only one or the
other is used at a time, but that will be changing soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 07c09b72 15-Feb-2013 Alex Elder <elder@inktank.com>

libceph: make ceph_msg->bio_seg be unsigned

The bio_seg field is used by the ceph messenger in iterating through
a bio. It should never have a negative value, so make it an
unsigned. (I contemplated making it unsigned short to match the
struct bio definition, but it offered no benefit.)

Change variables used to hold bio_seg values to all be unsigned as
well. Change two variable names in init_bio_iter() to match the
convention used everywhere else.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 49659416 18-Feb-2013 Alex Elder <elder@inktank.com>

libceph: use a do..while loop in con_work()

This just converts a manually-implemented loop into a do..while loop
in con_work(). It also moves handling of EAGAIN inside the blocks
where it's already been determined an error code was returned.

Also update a few dout() calls near the affected code for
consistency.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# b6e7b6a1 18-Feb-2013 Alex Elder <elder@inktank.com>

libceph: use a flag to indicate a fault has occurred

This just rearranges the logic in con_work() a little bit so that a
flag is used to indicate a fault has occurred. This allows both the
fault and non-fault case to be handled the same way and avoids a
couple of nearly consecutive gotos.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 93209264 18-Feb-2013 Alex Elder <elder@inktank.com>

libceph: separate non-locked fault handling

An error occurring on a ceph connection is treated as a fault,
causing the connection to be reset. The initial part of this fault
handling has to be done while holding the connection mutex, but
it must then be dropped for the last part.

Separate the part of this fault handling that executes without the
lock into its own function, con_fault_finish(). Move the call to
this new function, as well as call that drops the connection mutex,
into ceph_fault(). Rename that function con_fault() to reflect that
it's only handling the connection part of the fault handling.

The motivation for this was a warning from sparse about the locking
being done here. Rearranging things this way keeps all the mutex
manipulation within ceph_fault(), and this stops sparse from
complaining.

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# f20a39fd 18-Feb-2013 Alex Elder <elder@inktank.com>

libceph: encapsulate connection backoff

Collect the code that tests for and implements a backoff delay for a
ceph connection into a new function, ceph_backoff().

Make the debug output messages in that part of the code report
things consistently by reporting a message in the socket closed
case, and by making the one for PREOPEN state report the connection
pointer like the rest.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 15417167 18-Feb-2013 Alex Elder <elder@inktank.com>

libceph: eliminate sparse warnings

Eliminate most of the problems in the libceph code that cause sparse
to issue warnings.
- Convert functions that are never referenced externally to have
static scope.
- Pass NULL rather than 0 for a pointer argument in one spot in
ceph_monc_delete_snapid()

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# c9ffc77a 20-Feb-2013 Alex Elder <elder@inktank.com>

libceph: define connection flag helpers

Define and use functions that encapsulate operations performed on
a connection's flags.

This resolves:
http://tracker.ceph.com/issues/4234

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 3ebc21f7 31-Jan-2013 Alex Elder <elder@inktank.com>

libceph: fix messenger CONFIG_BLOCK dependencies

The ceph messenger has a few spots that are only used when
bio messages are supported, and that's only when CONFIG_BLOCK
is defined. This surrounds a couple of spots with #ifdef's
that would cause a problem if CONFIG_BLOCK were not present
in the kernel configuration.

This resolves:
http://tracker.ceph.com/issues/3976

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 0fa6ebc6 27-Dec-2012 Sage Weil <sage@inktank.com>

libceph: fix protocol feature mismatch failure path

We should not set con->state to CLOSED here; that happens in
ceph_fault() in the caller, where it first asserts that the state
is not yet CLOSED. Avoids a BUG when the features don't match.

Since the fail_protocol() has become a trivial wrapper, replace
calls to it with direct calls to reset_connection().

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 122070a2 26-Dec-2012 Alex Elder <elder@inktank.com>

libceph: WARN, don't BUG on unexpected connection states

A number of assertions in the ceph messenger are implemented with
BUG_ON(), killing the system if connection's state doesn't match
what's expected. At this point our state model is (evidently) not
well understood enough for these assertions to trigger a BUG().
Convert all BUG_ON(con->state...) calls to be WARN_ON(con->state...)
so we learn about these issues without killing the machine.

We now recognize that a connection fault can occur due to a socket
closure at any time, regardless of the state of the connection. So
there is really nothing we can assert about the state of the
connection at that point so eliminate that assertion.

Reported-by: Ugis <ugis22@gmail.com>
Tested-by: Ugis <ugis22@gmail.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 28362986 14-Dec-2012 Alex Elder <elder@inktank.com>

libceph: report connection fault with warning

When a connection's socket disconnects, or if there's a protocol
error of some kind on the connection, a fault is signaled and
the connection is reset (closed and reopened, basically). We
currently get an error message on the log whenever this occurs.

A ceph connection will attempt to reestablish a socket connection
repeatedly if a fault occurs. This means that these error messages
will get repeatedly added to the log, which is undesirable.

Change the error message to be a warning, so they don't get
logged by default.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 7bb21d68 07-Dec-2012 Alex Elder <elder@inktank.com>

libceph: socket can close in any connection state

A connection's socket can close for any reason, independent of the
state of the connection (and without irrespective of the connection
mutex). As a result, the connectino can be in pretty much any state
at the time its socket is closed.

Handle those other cases at the top of con_work(). Pull this whole
block of code into a separate function to reduce the clutter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 7246240c 25-Oct-2012 Sage Weil <sage@inktank.com>

libceph: avoid NULL kref_put from NULL alloc_msg return

The ceph_on_in_msg_alloc() method calls the ->alloc_msg() helper which
may return NULL. It also drops con->mutex while it allocates a message,
which means that the connection state may change (e.g., get closed). If
that happens, we clean up and bail out. Avoid calling ceph_msg_put() on
a NULL return value and triggering a crash.

This was observed when an ->alloc_msg() call races with a timeout that
resends a zillion messages and resets the connection, and ->alloc_msg()
returns NULL (because the request was resent to another target).

Fixes http://tracker.newdream.net/issues/3342

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 9bd95261 24-Oct-2012 Sage Weil <sage@inktank.com>

libceph: avoid NULL kref_put when osd reset races with alloc_msg

The ceph_on_in_msg_alloc() method drops con->mutex while it allocates a
message. If that races with a timeout that resends a zillion messages and
resets the connection, and the ->alloc_msg() method returns a NULL message,
it will call ceph_msg_put(NULL) and BUG.

Fix by only calling put if msg is non-NULL.

Fixes http://tracker.newdream.net/issues/3142

Signed-off-by: Sage Weil <sage@inktank.com>


# 802c6d96 08-Oct-2012 Alex Elder <elder@inktank.com>

rbd: define common queue_con_delay()

This patch defines a single function, queue_con_delay() to call
queue_delayed_work() for a connection. It basically generalizes
what was previously queue_con() by adding the delay argument.
queue_con() is now a simple helper that passes 0 for its delay.
queue_con_delay() returns 0 if it queued work or an errno if it
did not for some reason.

If con_work() finds the BACKOFF flag set for a connection, it now
calls queue_con_delay() to handle arranging to start again after a
delay.

Note about connection reference counts: con_work() only ever gets
called as a work item function. At the time that work is scheduled,
a reference to the connection is acquired, and the corresponding
con_work() call is then responsible for dropping that reference
before it returns.

Previously, the backoff handling inside con_work() silently handed
off its reference to delayed work it scheduled. Now that
queue_con_delay() is used, a new reference is acquired for the
newly-scheduled work, and the original reference is dropped by the
con->ops->put() call at the end of the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 8618e30b 08-Oct-2012 Alex Elder <elder@inktank.com>

rbd: let con_work() handle backoff

Both ceph_fault() and con_work() include handling for imposing a
delay before doing further processing on a faulted connection.
The latter is used only if ceph_fault() is unable to.

Instead, just let con_work() always be responsible for implementing
the delay. After setting up the delay value, set the BACKOFF flag
on the connection unconditionally and call queue_con() to ensure
con_work() will get called to handle it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 588377d6 08-Oct-2012 Alex Elder <elder@inktank.com>

rbd: reset BACKOFF if unable to re-queue

If ceph_fault() is unable to queue work after a delay, it sets the
BACKOFF connection flag so con_work() will attempt to do so.

In con_work(), when BACKOFF is set, if queue_delayed_work() doesn't
result in newly-queued work, it simply ignores this condition and
proceeds as if no backoff delay were desired. There are two
problems with this--one of which is a bug.

The first problem is simply that the intended behavior is to back
off, and if we aren't able queue the work item to run after a delay
we're not doing that.

The only reason queue_delayed_work() won't queue work is if the
provided work item is already queued. In the messenger, this
means that con_work() is already scheduled to be run again. So
if we simply set the BACKOFF flag again when this occurs, we know
the next con_work() call will again attempt to hold off activity
on the connection until after the delay.

The second problem--the bug--is a leak of a reference count. If
queue_delayed_work() returns 0 in con_work(), con->ops->put() drops
the connection reference held on entry to con_work(). However,
processing is (was) allowed to continue, and at the end of the
function a second con->ops->put() is called.

This patch fixes both problems.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 5ce765a5 21-Sep-2012 Alex Elder <elder@inktank.com>

libceph: only kunmap kmapped pages

In write_partial_msg_pages(), pages need to be kmapped in order to
perform a CRC-32c calculation on them. As an artifact of the way
this code used to be structured, the kunmap() call was separated
from the kmap() call and both were done conditionally. But the
conditions under which the kmap() and kunmap() calls were made
differed, so there was a chance a kunmap() call would be done on a
page that had not been mapped.

The symptom of this was tripping a BUG() in kunmap_high() when
pkmap_count[nr] became 0.

Reported-by: Bryan K. Wright <bryan@virginia.edu>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 6d4221b5 10-Aug-2012 Jim Schutt <jaschut@sandia.gov>

libceph: avoid truncation due to racing banners

Because the Ceph client messenger uses a non-blocking connect, it is
possible for the sending of the client banner to race with the
arrival of the banner sent by the peer.

When ceph_sock_state_change() notices the connect has completed, it
schedules work to process the socket via con_work(). During this
time the peer is writing its banner, and arrival of the peer banner
races with con_work().

If con_work() calls try_read() before the peer banner arrives, there
is nothing for it to do, after which con_work() calls try_write() to
send the client's banner. In this case Ceph's protocol negotiation
can complete succesfully.

The server-side messenger immediately sends its banner and addresses
after accepting a connect request, *before* actually attempting to
read or verify the banner from the client. As a result, it is
possible for the banner from the server to arrive before con_work()
calls try_read(). If that happens, try_read() will read the banner
and prepare protocol negotiation info via prepare_write_connect().
prepare_write_connect() calls con_out_kvec_reset(), which discards
the as-yet-unsent client banner. Next, con_work() calls
try_write(), which sends the protocol negotiation info rather than
the banner that the peer is expecting.

The result is that the peer sees an invalid banner, and the client
reports "negotiation failed".

Fix this by moving con_out_kvec_reset() out of
prepare_write_connect() to its callers at all locations except the
one where the banner might still need to be sent.

[elder@inktak.com: added note about server-side behavior]

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>


# 61399191 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: recheck con state after allocating incoming message

We drop the lock when calling the ->alloc_msg() con op, which means
we need to (a) not clobber con->in_msg without the mutex held, and (b)
we need to verify that we are still in the OPEN state when we retake
it to avoid causing any mayhem. If the state does change, -EAGAIN
will get us back to con_work() and loop.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 4740a623 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: change ceph_con_in_msg_alloc convention to be less weird

This function's calling convention is very limiting. In particular,
we can't return any error other than ENOMEM (and only implicitly),
which is a problem (see next patch).

Instead, return an normal 0 or error code, and make the skip a pointer
output parameter. Drop the useless in_hdr argument (we have the con
pointer).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 8636ea67 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: avoid dropping con mutex before fault

The ceph_fault() function takes the con mutex, so we should avoid
dropping it before calling it. This fixes a potential race with
another thread calling ceph_con_close(), or _open(), or similar (we
don't reverify con->state after retaking the lock).

Add annotation so that lockdep realizes we will drop the mutex before
returning.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 7b862e07 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: verify state after retaking con lock after dispatch

We drop the con mutex when delivering a message. When we retake the
lock, we need to verify we are still in the OPEN state before
preparing to read the next tag, or else we risk stepping on a
connection that has been closed.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 8007b8d6 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: fix handling of immediate socket connect failure

If the connect() call immediately fails such that sock == NULL, we
still need con_close_socket() to reset our socket state to CLOSED.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 43c7427d 20-Jul-2012 Sage Weil <sage@inktank.com>

libceph: clear all flags on con_close

Signed-off-by: Sage Weil <sage@inktank.com>


# 4a861692 20-Jul-2012 Sage Weil <sage@inktank.com>

libceph: clean up con flags

Rename flags with CON_FLAG prefix, move the definitions into the c file,
and (better) document their meaning.

Signed-off-by: Sage Weil <sage@inktank.com>


# 8dacc7da 20-Jul-2012 Sage Weil <sage@inktank.com>

libceph: replace connection state bits with states

Use a simple set of 6 enumerated values for the socket states (CON_STATE_*)
and use those instead of the state bits. All of the con->state checks are
now under the protection of the con mutex, so this is safe. It also
simplifies many of the state checks because we can check for anything other
than the expected state instead of various bits for races we can think of.

This appears to hold up well to stress testing both with and without socket
failure injection on the server side.

Signed-off-by: Sage Weil <sage@inktank.com>


# d7353dd5 20-Jul-2012 Sage Weil <sage@inktank.com>

libceph: drop unnecessary CLOSED check in socket state change callback

If we are CLOSED, the socket is closed and we won't get these.

Signed-off-by: Sage Weil <sage@inktank.com>


# ee76e073 20-Jul-2012 Sage Weil <sage@inktank.com>

libceph: close socket directly from ceph_con_close()

It is simpler to do this immediately, since we already hold the con mutex.
It also avoids the need to deal with a not-quite-CLOSED socket in con_work.

Signed-off-by: Sage Weil <sage@inktank.com>


# 2e8cb100 20-Jul-2012 Sage Weil <sage@inktank.com>

libceph: drop gratuitous socket close calls in con_work

If the state is CLOSED or OPENING, we shouldn't have a socket.

Signed-off-by: Sage Weil <sage@inktank.com>


# a59b55a6 20-Jul-2012 Sage Weil <sage@inktank.com>

libceph: move ceph_con_send() closed check under the con mutex

Take the con mutex before checking whether the connection is closed to
avoid racing with someone else closing it.

Signed-off-by: Sage Weil <sage@inktank.com>


# 00650931 20-Jul-2012 Sage Weil <sage@inktank.com>

libceph: move msgr clear_standby under con mutex protection

Avoid dropping and retaking con->mutex in the ceph_con_send() case by
leaving locking up to the caller.

Signed-off-by: Sage Weil <sage@inktank.com>


# 3b5ede07 20-Jul-2012 Sage Weil <sage@inktank.com>

libceph: fix fault locking; close socket on lossy fault

If we fault on a lossy connection, we should still close the socket
immediately, and do so under the con mutex.

We should also take the con mutex before printing out the state bits in
the debug output.

Signed-off-by: Sage Weil <sage@inktank.com>


# 85effe18 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: reset connection retry on successfully negotiation

We exponentially back off when we encounter connection errors. If several
errors accumulate, we will eventually wait ages before even trying to
reconnect.

Fix this by resetting the backoff counter after a successful negotiation/
connection with the remote node. Fixes ceph issue #2802.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 5469155f 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: protect ceph_con_open() with mutex

Take the con mutex while we are initiating a ceph open. This is necessary
because the may have previously been in use and then closed, which could
result in a racing workqueue running con_work().

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# a4107026 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: (re)initialize bio_iter on start of message receive

Previously, we were opportunistically initializing the bio_iter if it
appeared to be uninitialized in the middle of the read path. The problem
is that a sequence like:

- start reading message
- initialize bio_iter
- read half a message
- messenger fault, reconnect
- restart reading message
- ** bio_iter now non-NULL, not reinitialized **
- read past end of bio, crash

Instead, initialize the bio_iter unconditionally when we allocate/claim
the message for read.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>


# 8c50c817 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: fix mutex coverage for ceph_con_close

Hold the mutex while twiddling all of the state bits to avoid possible
races. While we're here, make not of why we cannot close the socket
directly.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>


# 3a140a0d 30-Jul-2012 Sage Weil <sage@inktank.com>

libceph: report socket read/write error message

We need to set error_msg to something useful before calling ceph_fault();
do so here for try_{read,write}(). This is more informative than

libceph: osd0 192.168.106.220:6801 (null)

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>


# a2a32584 08-Jul-2012 Guanjun He <gjhe@suse.com>

libceph: prevent the race of incoming work during teardown

Add an atomic variable 'stopping' as flag in struct ceph_messenger,
set this flag to 1 in function ceph_destroy_client(), and add the condition code
in function ceph_data_ready() to test the flag value, if true(1), just return.

Signed-off-by: Guanjun He <gjhe@suse.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# a16cb1f7 10-Jul-2012 Sage Weil <sage@inktank.com>

libceph: fix messenger retry

In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time. This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>


# 5bdca4e0 10-Jul-2012 Sage Weil <sage@inktank.com>

libceph: fix messenger retry

In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time. This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>


# fbb85a47 27-Jun-2012 Sage Weil <sage@inktank.com>

libceph: allow sock transition from CONNECTING to CLOSED

It is possible to close a socket that is in the OPENING state. For
example, it can happen if ceph_con_close() is called on the con before
the TCP connection is established. con_work() will come around and shut
down the socket.

Signed-off-by: Sage Weil <sage@inktank.com>


# b7a9e5dd 27-Jun-2012 Sage Weil <sage@inktank.com>

libceph: set peer name on con_open, not init

The peer name may change on each open attempt, even when the connection is
reused.

Signed-off-by: Sage Weil <sage@inktank.com>


# bc18f4b1 20-Jun-2012 Alex Elder <elder@inktank.com>

libceph: add some fine ASCII art

Sage liked the state diagram I put in my commit description so
I'm putting it in with the code.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 5821bd8c 11-Jun-2012 Alex Elder <elder@inktank.com>

libceph: small changes to messenger.c

This patch gathers a few small changes in "net/ceph/messenger.c":
out_msg_pos_next()
- small logic change that mostly affects indentation
write_partial_msg_pages().
- use a local variable trail_off to represent the offset into
a message of the trail portion of the data (if present)
- once we are in the trail portion we will always be there, so we
don't always need to check against our data position
- avoid computing len twice after we've reached the trail
- get rid of the variable tmpcrc, which is not needed
- trail_off and trail_len never change so mark them const
- update some comments
read_partial_message_bio()
- bio_iovec_idx() will never return an error, so don't bother
checking for it

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 7593af92 24-May-2012 Alex Elder <elder@inktank.com>

libceph: distinguish two phases of connect sequence

Currently a ceph connection enters a "CONNECTING" state when it
begins the process of (re-)connecting with its peer. Once the two
ends have successfully exchanged their banner and addresses, an
additional NEGOTIATING bit is set in the ceph connection's state to
indicate the connection information exhange has begun. The
CONNECTING bit/state continues to be set during this phase.

Rather than have the CONNECTING state continue while the NEGOTIATING
bit is set, interpret these two phases as distinct states. In other
words, when NEGOTIATING is set, clear CONNECTING. That way only
one of them will be active at a time.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# ab166d5a 31-May-2012 Alex Elder <elder@inktank.com>

libceph: separate banner and connect writes

There are two phases in the process of linking together the two ends
of a ceph connection. The first involves exchanging a banner and
IP addresses, and if that is successful a second phase exchanges
some detail about each side's connection capabilities.

When initiating a connection, the client side now queues to send
its information for both phases of this process at the same time.
This is probably a bit more efficient, but it is slightly messier
from a layering perspective in the code.

So rearrange things so that the client doesn't send the connection
information until it has received and processed the response in the
initial banner phase (in process_banner()).

Move the code (in the (con->sock == NULL) case in try_write()) that
prepares for writing the connection information, delaying doing that
until the banner exchange has completed. Move the code that begins
the transition to this second "NEGOTIATING" phase out of
process_banner() and into its caller, so preparing to write the
connection information and preparing to read the response are
adjacent to each other.

Finally, preparing to write the connection information now requires
the output kvec to be reset in all cases, so move that into the
prepare_write_connect() and delete it from all callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# e27947c7 23-May-2012 Alex Elder <elder@inktank.com>

libceph: define and use an explicit CONNECTED state

There is no state explicitly defined when a ceph connection is fully
operational. So define one.

It's set when the connection sequence completes successfully, and is
cleared when the connection gets closed.

Be a little more careful when examining the old state when a socket
disconnect event is reported.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 3ec50d18 23-May-2012 Alex Elder <elder@inktank.com>

libceph: clear NEGOTIATING when done

A connection state's NEGOTIATING bit gets set while in CONNECTING
state after we have successfully exchanged a ceph banner and IP
addresses with the connection's peer (the server). But that bit
is not cleared again--at least not until another connection attempt
is initiated.

Instead, clear it as soon as the connection is fully established.
Also, clear it when a socket connection gets prematurely closed
in the midst of establishing a ceph connection (in case we had
reached the point where it was set).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# bb9e6bba 20-Jun-2012 Alex Elder <elder@inktank.com>

libceph: clear CONNECTING in ceph_con_close()

A connection that is closed will no longer be connecting. So
clear the CONNECTING state bit in ceph_con_close(). Similarly,
if the socket has been closed we no longer are in connecting
state (a new connect sequence will need to be initiated).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 456ea468 20-Jun-2012 Alex Elder <elder@inktank.com>

libceph: don't touch con state in con_close_socket()

In con_close_socket(), a connection's SOCK_CLOSED flag gets set and
then cleared while its shutdown method is called and its reference
gets dropped.

Previously, that flag got set only if it had not already been set,
so setting it in con_close_socket() might have prevented additional
processing being done on a socket being shut down. We no longer set
SOCK_CLOSED in the socket event routine conditionally, so setting
that bit here no longer provides whatever benefit it might have
provided before.

A race condition could still leave the SOCK_CLOSED bit set even
after we've issued the call to con_close_socket(), so we still clear
that bit after shutting the socket down. Add a comment explaining
the reason for this.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# d65c9e0b 20-Jun-2012 Alex Elder <elder@inktank.com>

libceph: just set SOCK_CLOSED when state changes

When a TCP_CLOSE or TCP_CLOSE_WAIT event occurs, the SOCK_CLOSED
connection flag bit is set, and if it had not been previously set
queue_con() is called to ensure con_work() will get a chance to
handle the changed state.

con_work() atomically checks--and if set, clears--the SOCK_CLOSED
bit if it was set. This means that even if the bit were set
repeatedly, the related processing in con_work() only gets called
once per transition of the bit from 0 to 1.

What's important then is that we ensure con_work() gets called *at
least* once when a socket close event occurs, not that it gets
called *exactly* once.

The work queue mechanism already takes care of queueing work
only if it is not already queued, so there's no need for us
to call queue_con() conditionally.

So this patch just makes it so the SOCK_CLOSED flag gets set
unconditionally in ceph_sock_state_change().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 188048bc 20-Jun-2012 Alex Elder <elder@inktank.com>

libceph: don't change socket state on sock event

Currently the socket state change event handler records an error
message on a connection to distinguish a close while connecting from
a close while a connection was already established.

Changing connection information during handling of a socket event is
not very clean, so instead move this assignment inside con_work(),
where it can be done during normal connection-level processing (and
under protection of the connection mutex as well).

Move the handling of a socket closed event up to the top of the
processing loop in con_work(); there's no point in handling backoff
etc. if we have a newly-closed socket to take care of.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# a8d00e3c 20-Jun-2012 Alex Elder <elder@inktank.com>

libceph: SOCK_CLOSED is a flag, not a state

The following commit changed it so SOCK_CLOSED bit was stored in
a connection's new "flags" field rather than its "state" field.

libceph: start separating connection flags from state
commit 928443cd

That bit is used in con_close_socket() to protect against setting an
error message more than once in the socket event handler function.

Unfortunately, the field being operated on in that function was not
updated to be "flags" as it should have been. This fixes that
error.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# abdaa6a8 11-Jun-2012 Alex Elder <elder@inktank.com>

libceph: don't use bio_iter as a flag

Recently a bug was fixed in which the bio_iter field in a ceph
message was not being properly re-initialized when a message got
re-transmitted:
commit 43643528cce60ca184fe8197efa8e8da7c89a037
Author: Yan, Zheng <zheng.z.yan@intel.com>
rbd: Clear ceph_msg->bio_iter for retransmitted message

We are now only initializing the bio_iter field when we are about to
start to write message data (in prepare_write_message_data()),
rather than every time we are attempting to write any portion of the
message data (in write_partial_msg_pages()). This means we no
longer need to use the msg->bio_iter field as a flag.

So just don't do that any more. Trust prepare_write_message_data()
to ensure msg->bio_iter is properly initialized, every time we are
about to begin writing (or re-writing) a message's bio data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 572c588e 11-Jun-2012 Alex Elder <elder@inktank.com>

libceph: move init of bio_iter

If a message has a non-null bio pointer, its bio_iter field is
initialized in write_partial_msg_pages() if this has not been done
already. This is really a one-time setup operation for sending a
message's (bio) data, so move that initialization code into
prepare_write_message_data() which serves that purpose.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# df6ad1f9 11-Jun-2012 Alex Elder <elder@inktank.com>

libceph: move init_bio_*() functions up

Move init_bio_iter() and iter_bio_next() up in their source file so
the'll be defined before they're needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# fd154f3c 11-Jun-2012 Alex Elder <elder@inktank.com>

libceph: don't mark footer complete before it is

This is a nit, but prepare_write_message() sets the FOOTER_COMPLETE
flag before the CRC for the data portion (recorded in the footer)
has been completely computed. Hold off setting the complete flag
until we've decided it's ready to send.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 84ca8fc8 11-Jun-2012 Alex Elder <elder@inktank.com>

libceph: encapsulate advancing msg page

In write_partial_msg_pages(), once all the data from a page has been
sent we advance to the next one. Put the code that takes care of
this into its own function.

While modifying write_partial_msg_pages(), make its local variable
"in_trail" be Boolean, and use the local variable "msg" (which is
just the connection's current out_msg pointer) consistently.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 739c905b 11-Jun-2012 Alex Elder <elder@inktank.com>

libceph: encapsulate out message data setup

Move the code that prepares to write the data portion of a message
into its own function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# d59315ca 21-Jun-2012 Sage Weil <sage@inktank.com>

libceph: drop ceph_con_get/put helpers and nref member

These are no longer used. Every ceph_connection instance is embedded in
another structure, and refcounts manipulated via the get/put ops.

Signed-off-by: Sage Weil <sage@inktank.com>


# 36eb71aa 21-Jun-2012 Sage Weil <sage@inktank.com>

libceph: use con get/put methods

The ceph_con_get/put() helpers manipulate the embedded con ref
count, which isn't used now that ceph_connections are embedded in
other structures.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# b132cf4c 06-Jun-2012 Yan, Zheng <zheng.z.yan@intel.com>

rbd: Clear ceph_msg->bio_iter for retransmitted message

The bug can cause NULL pointer dereference in write_partial_msg_pages

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit 43643528cce60ca184fe8197efa8e8da7c89a037)


# 26ce1719 19-Jun-2012 Dan Carpenter <dan.carpenter@oracle.com>

libceph: fix NULL dereference in reset_connection()

We dereference "con->in_msg" on the line after it was set to NULL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 89a86be0 09-Jun-2012 Sage Weil <sage@inktank.com>

libceph: transition socket state prior to actual connect

Once we call ->connect(), we are racing against the actual
connection, and a subsequent transition from CONNECTING ->
CONNECTED. Set the state to CONNECTING before that, under the
protection of the mutex, to avoid the race.

This was introduced in 928443cd9644e7cfd46f687dbeffda2d1a357ff9,
with the original socket state code.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 43643528 06-Jun-2012 Yan, Zheng <zheng.z.yan@intel.com>

rbd: Clear ceph_msg->bio_iter for retransmitted message

The bug can cause NULL pointer dereference in write_partial_msg_pages

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 8921d114 01-Jun-2012 Alex Elder <elder@inktank.com>

libceph: make ceph_con_revoke_message() a msg op

ceph_con_revoke_message() is passed both a message and a ceph
connection. A ceph_msg allocated for incoming messages on a
connection always has a pointer to that connection, so there's no
need to provide the connection when revoking such a message.

Note that the existing logic does not preclude the message supplied
being a null/bogus message pointer. The only user of this interface
is the OSD client, and the only value an osd client passes is a
request's r_reply field. That is always non-null (except briefly in
an error path in ceph_osdc_alloc_request(), and that drops the
only reference so the request won't ever have a reply to revoke).
So we can safely assume the passed-in message is non-null, but add a
BUG_ON() to make it very obvious we are imposing this restriction.

Rename the function ceph_msg_revoke_incoming() to reflect that it is
really an operation on an incoming message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 6740a845 01-Jun-2012 Alex Elder <elder@inktank.com>

libceph: make ceph_con_revoke() a msg operation

ceph_con_revoke() is passed both a message and a ceph connection.
Now that any message associated with a connection holds a pointer
to that connection, there's no need to provide the connection when
revoking a message.

This has the added benefit of precluding the possibility of the
providing the wrong connection pointer. If the message's connection
pointer is null, it is not being tracked by any connection, so
revoking it is a no-op. This is supported as a convenience for
upper layers, so they can revoke a message that is not actually
"in flight."

Rename the function ceph_msg_revoke() to reflect that it is really
an operation on a message, not a connection.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 92ce034b 04-Jun-2012 Alex Elder <elder@inktank.com>

libceph: have messages take a connection reference

There are essentially two types of ceph messages: incoming and
outgoing. Outgoing messages are always allocated via ceph_msg_new(),
and at the time of their allocation they are not associated with any
particular connection. Incoming messages are always allocated via
ceph_con_in_msg_alloc(), and they are initially associated with the
connection from which incoming data will be placed into the message.

When an outgoing message gets sent, it becomes associated with a
connection and remains that way until the message is successfully
sent. The association of an incoming message goes away at the point
it is sent to an upper layer via a con->ops->dispatch method.

This patch implements reference counting for all ceph messages, such
that every message holds a reference (and a pointer) to a connection
if and only if it is associated with that connection (as described
above).


For background, here is an explanation of the ceph message
lifecycle, emphasizing when an association exists between a message
and a connection.

Outgoing Messages
An outgoing message is "owned" by its allocator, from the time it is
allocated in ceph_msg_new() up to the point it gets queued for
sending in ceph_con_send(). Prior to that point the message's
msg->con pointer is null; at the point it is queued for sending its
message pointer is assigned to refer to the connection. At that
time the message is inserted into a connection's out_queue list.

When a message on the out_queue list has been sent to the socket
layer to be put on the wire, it is transferred out of that list and
into the connection's out_sent list. At that point it is still owned
by the connection, and will remain so until an acknowledgement is
received from the recipient that indicates the message was
successfully transferred. When such an acknowledgement is received
(in process_ack()), the message is removed from its list (in
ceph_msg_remove()), at which point it is no longer associated with
the connection.

So basically, any time a message is on one of a connection's lists,
it is associated with that connection. Reference counting outgoing
messages can thus be done at the points a message is added to the
out_queue (in ceph_con_send()) and the point it is removed from
either its two lists (in ceph_msg_remove())--at which point its
connection pointer becomes null.

Incoming Messages
When an incoming message on a connection is getting read (in
read_partial_message()) and there is no message in con->in_msg,
a new one is allocated using ceph_con_in_msg_alloc(). At that
point the message is associated with the connection. Once that
message has been completely and successfully read, it is passed to
upper layer code using the connection's con->ops->dispatch method.
At that point the association between the message and the connection
no longer exists.

Reference counting of connections for incoming messages can be done
by taking a reference to the connection when the message gets
allocated, and releasing that reference when it gets handed off
using the dispatch method.

We should never fail to get a connection reference for a
message--the since the caller should already hold one.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 38941f80 01-Jun-2012 Alex Elder <elder@inktank.com>

libceph: have messages point to their connection

When a ceph message is queued for sending it is placed on a list of
pending messages (ceph_connection->out_queue). When they are
actually sent over the wire, they are moved from that list to
another (ceph_connection->out_sent). When acknowledgement for the
message is received, it is removed from the sent messages list.

During that entire time the message is "in the possession" of a
single ceph connection. Keep track of that connection in the
message. This will be used in the next patch (and is a helpful
bit of information for debugging anyway).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 1c20f2d2 04-Jun-2012 Alex Elder <elder@inktank.com>

libceph: tweak ceph_alloc_msg()

The function ceph_alloc_msg() is only used to allocate a message
that will be assigned to a connection's in_msg pointer. Rename the
function so this implied usage is more clear.

In addition, make that assignment inside the function (again, since
that's precisely what it's intended to be used for). This allows us
to return what is now provided via the passed-in address of a "skip"
variable. The return type is now Boolean to be explicit that there
are only two possible outcomes.

Make sure the result of an ->alloc_msg method call always sets the
value of *skip properly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 1bfd89f4 26-May-2012 Alex Elder <elder@inktank.com>

libceph: fully initialize connection in con_init()

Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init(). Rearrange the arguments so the connection pointer
is first. Hide the byte-swapping of the peer entity number inside
ceph_con_init()

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# a5988c49 29-May-2012 Alex Elder <elder@inktank.com>

libceph: set CLOSED state bit in con_init

Once a connection is fully initialized, it is really in a CLOSED
state, so make that explicit by setting the bit in its state field.

It is possible for a connection in NEGOTIATING state to get a
failure, leading to ceph_fault() and ultimately ceph_con_close().
Clear that bits if it is set in that case, to reflect that the
connection truly is closed and is no longer participating in a
connect sequence.

Issue a warning if ceph_con_open() is called on a connection that
is not in CLOSED state.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# ce2c8903 22-May-2012 Alex Elder <elder@inktank.com>

libceph: start tracking connection socket state

Start explicitly keeping track of the state of a ceph connection's
socket, separate from the state of the connection itself. Create
placeholder functions to encapsulate the state transitions.

--------
| NEW* | transient initial state
--------
| con_sock_state_init()
v
----------
| CLOSED | initialized, but no socket (and no
---------- TCP connection)
^ \
| \ con_sock_state_connecting()
| ----------------------
| \
+ con_sock_state_closed() \
|\ \
| \ \
| ----------- \
| | CLOSING | socket event; \
| ----------- await close \
| ^ |
| | |
| + con_sock_state_closing() |
| / \ |
| / --------------- |
| / \ v
| / --------------
| / -----------------| CONNECTING | socket created, TCP
| | / -------------- connect initiated
| | | con_sock_state_connected()
| | v
-------------
| CONNECTED | TCP connection established
-------------

Make the socket state an atomic variable, reinforcing that it's a
distinct transtion with no possible "intermediate/both" states.
This is almost certainly overkill at this point, though the
transitions into CONNECTED and CLOSING state do get called via
socket callback (the rest of the transitions occur with the
connection mutex held). We can back out the atomicity later.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>


# 928443cd 22-May-2012 Alex Elder <elder@inktank.com>

libceph: start separating connection flags from state

A ceph_connection holds a mixture of connection state (as in "state
machine" state) and connection flags in a single "state" field. To
make the distinction more clear, define a new "flags" field and use
it rather than the "state" field to hold Boolean flag values.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>


# 15d9882c 26-May-2012 Alex Elder <elder@inktank.com>

libceph: embed ceph messenger structure in ceph_client

A ceph client has a pointer to a ceph messenger structure in it.
There is always exactly one ceph messenger for a ceph client, so
there is no need to allocate it separate from the ceph client
structure.

Switch the ceph_client structure to embed its ceph_messenger
structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# e2200423 23-May-2012 Alex Elder <elder@inktank.com>

libceph: rename kvec_reset and kvec_add functions

The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add()
are entirely private functions, so drop the "ceph_" prefix in their
name to make them slightly more wieldy.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 327800bd 22-May-2012 Alex Elder <elder@inktank.com>

libceph: rename socket callbacks

Change the names of the three socket callback functions to make it
more obvious they're specifically associated with a connection's
socket (not the ceph connection that uses it).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 6384bb8b 29-May-2012 Alex Elder <elder@inktank.com>

libceph: kill bad_proto ceph connection op

No code sets a bad_proto method in its ceph connection operations
vector, so just get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>


# e5e372da 22-May-2012 Alex Elder <elder@inktank.com>

libceph: eliminate connection state "DEAD"

The ceph connection state "DEAD" is never set and is therefore not
needed. Eliminate it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>


# 3da54776 16-May-2012 Alex Elder <elder@inktank.com>

ceph: add auth buf in prepare_write_connect()

Move the addition of the authorizer buffer to a connection's
out_kvec out of get_connect_authorizer() and into its caller. This
way, the caller--prepare_write_connect()--can avoid adding the
connect header to out_kvec before it has been fully initialized.

Prior to this patch, it was possible for a connect header to be
sent over the wire before the authorizer protocol or buffer length
fields were initialized. An authorizer buffer associated with that
header could also be queued to send only after the connection header
that describes it was on the wire.

Fixes http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# dac1e716 16-May-2012 Alex Elder <elder@inktank.com>

ceph: rename prepare_connect_authorizer()

Change the name of prepare_connect_authorizer(). The next
patch is going to make this function no longer add anything to the
connection's out_kvec, so it will no longer fit the pattern of
the rest of the prepare_connect_*() functions.

In addition, pass the address of a variable that will hold the
authorization protocol to use. Move the assignment of that to the
connection's out_connect structure into prepare_write_connect().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 729796be 16-May-2012 Alex Elder <elder@inktank.com>

ceph: return pointer from prepare_connect_authorizer()

Change prepare_connect_authorizer() so it returns a pointer (or
pointer-coded error).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 8f43fb53 16-May-2012 Alex Elder <elder@inktank.com>

ceph: use info returned by get_authorizer

Rather than passing a bunch of arguments to be filled in with the
content of the ceph_auth_handshake buffer now returned by the
get_authorizer method, just use the returned information in the
caller, and drop the unnecessary arguments.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# a3530df3 16-May-2012 Alex Elder <elder@inktank.com>

ceph: have get_authorizer methods return pointers

Have the get_authorizer auth_client method return a ceph_auth
pointer rather than an integer, pointer-encoding any returned
error value. This is to pave the way for making use of the
returned value in an upcoming patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# ed96af64 16-May-2012 Alex Elder <elder@inktank.com>

ceph: messenger: check return from get_authorizer

In prepare_connect_authorizer(), a connection's get_authorizer
method is called but ignores its return value. This function can
return an error, so check for it and return it if that ever occurs.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# b1c6b980 16-May-2012 Alex Elder <elder@inktank.com>

ceph: messenger: rework prepare_connect_authorizer()

Change prepare_connect_authorizer() so it returns without dropping
the connection mutex if the connection has no get_authorizer method.

Use the symbolic CEPH_AUTH_UNKNOWN instead of 0 when assigning
authorization protocols.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 5a0f8fdd 16-May-2012 Alex Elder <elder@inktank.com>

ceph: messenger: check prepare_write_connect() result

prepare_write_connect() can return an error, but only one of its
callers checks for it. All the rest are in functions that already
return errors, so it should be fine to return the error if one
gets returned.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# e10c758e 16-May-2012 Alex Elder <elder@inktank.com>

ceph: don't set WRITE_PENDING too early

prepare_write_connect() prepares a connect message, then sets
WRITE_PENDING on the connection. Then *after* this, it calls
prepare_connect_authorizer(), which updates the content of the
connection buffer already queued for sending. It's also possible it
will result in prepare_write_connect() returning -EAGAIN despite the
WRITE_PENDING big getting set.

Fix this by preparing the connect authorizer first, setting the
WRITE_PENDING bit only after that is done.

Partially addresses http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# e825a66d 16-May-2012 Alex Elder <elder@inktank.com>

ceph: drop msgr argument from prepare_write_connect()

In all cases, the value passed as the msgr argument to
prepare_write_connect() is just con->msgr. Just get the msgr
value from the ceph connection and drop the unneeded argument.

The only msgr passed to prepare_write_banner() is also therefore
just the one from con->msgr, so change that function to drop the
msgr argument as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 41b90c00 16-May-2012 Alex Elder <elder@inktank.com>

ceph: messenger: send banner in process_connect()

prepare_write_connect() has an argument indicating whether a banner
should be sent out before sending out a connection message. It's
only ever set in one of its callers, so move the code that arranges
to send the banner into that caller and drop the "include_banner"
argument from prepare_write_connect().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 84fb3adf 16-May-2012 Alex Elder <elder@inktank.com>

ceph: messenger: reset connection kvec caller

Reset a connection's kvec fields in the caller rather than in
prepare_write_connect(). This ends up repeating a few lines of
code but it's improving the separation between distinct operations
on the connection, which we can take advantage of later.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# d329156f 16-May-2012 Alex Elder <elder@inktank.com>

libceph: don't reset kvec in prepare_write_banner()

Move the kvec reset for a connection out of prepare_write_banner and
into its only caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# fd51653f 10-May-2012 Alex Elder <elder@inktank.com>

ceph: messenger: change read_partial() to take "end" arg

Make the second argument to read_partial() be the ending input byte
position rather than the beginning offset it now represents. This
amounts to moving the addition "to + size" into the caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# e6cee71f 10-May-2012 Alex Elder <elder@inktank.com>

ceph: messenger: update "to" in read_partial() caller

read_partial() always increases whatever "to" value is supplied by
adding the requested size to it, and that's the only thing it does
with that pointed-to value.

Do that pointer advance in the caller (and then only when the
updated value will be subsequently used), and change the "to"
parameter to be an in-only and non-pointer value.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 57dac9d1 10-May-2012 Alex Elder <elder@inktank.com>

ceph: messenger: use read_partial() in read_partial_message()

There are two blocks of code in read_partial_message()--those that
read the header and footer of the message--that can be replaced by a
call to read_partial(). Do that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 95c96174 14-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com>

net: cleanup unsigned to unsigned int

Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8d63e318 07-Mar-2012 Alex Elder <elder@dreamhost.com>

libceph: isolate kmap() call in write_partial_msg_pages()

In write_partial_msg_pages(), every case now does an identical call
to kmap(page). Instead, just call it once inside the CRC-computing
block where it's needed. Move the definition of kaddr inside that
block, and make it a (char *) to ensure portable pointer arithmetic.

We still don't kunmap() it until after the sendpage() call, in case
that also ends up needing to use the mapping.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>


# 9bd19663 07-Mar-2012 Alex Elder <elder@dreamhost.com>

libceph: rename "page_shift" variable to something sensible

In write_partial_msg_pages() there is a local variable used to
track the starting offset within a bio segment to use. Its name,
"page_shift" defies the Linux convention of using that name for
log-base-2(page size).

Since it's only used in the bio case rename it "bio_offset". Use it
along with the page_pos field to compute the memory offset when
computing CRC's in that function. This makes the bio case match the
others more closely.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>


# 0cdf9e60 07-Mar-2012 Alex Elder <elder@dreamhost.com>

libceph: get rid of zero_page_address

There's not a lot of benefit to zero_page_address, which basically
holds a mapping of the zero page through the life of the messenger
module. Even with our own mapping, the sendpage interface where
it's used may need to kmap() it again. It's almost certain to
be in low memory anyway.

So stop treating the zero page specially in write_partial_msg_pages()
and just get rid of zero_page_address entirely.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>


# e36b13cc 07-Mar-2012 Alex Elder <elder@dreamhost.com>

libceph: only call kernel_sendpage() via helper

Make ceph_tcp_sendpage() be the only place kernel_sendpage() is
used, by using this helper in write_partial_msg_pages().

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>


# 31739139 07-Mar-2012 Alex Elder <elder@dreamhost.com>

libceph: use kernel_sendpage() for sending zeroes

If a message queued for send gets revoked, zeroes are sent over the
wire instead of any unsent data. This is done by constructing a
message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg().

Since we are already working with a page in this case we can use
the sendpage interface instead. Create a new ceph_tcp_sendpage()
helper that sets up flags to match the way ceph_tcp_sendmsg()
does now.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>


# 37675b0f 07-Mar-2012 Alex Elder <elder@dreamhost.com>

libceph: fix inverted crc option logic

CRC's are computed for all messages between ceph entities. The CRC
computation for the data portion of message can optionally be
disabled using the "nocrc" (common) ceph option. The default is
for CRC computation for the data portion to be enabled.

Unfortunately, the code that implements this feature interprets the
feature flag wrong, meaning that by default the CRC's have *not*
been computed (or checked) for the data portion of messages unless
the "nocrc" option was supplied.

Fix this, in write_partial_msg_pages() and read_partial_message().
Also change the flag variable in write_partial_msg_pages() to be
"no_datacrc" to match the usage elsewhere in the file.

This fixes http://tracker.newdream.net/issues/2064

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>


# 84495f49 15-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: some simple changes

Nothing too big here.
- define the size of the buffer used for consuming ignored
incoming data using a symbolic constant
- simplify the condition determining whether to unmap the page
in write_partial_msg_pages(): do it for crc but not if the
page is the zero page

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# f42299e6 15-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: small refactor in write_partial_kvec()

Make a small change in the code that counts down kvecs consumed by
a ceph_tcp_sendmsg() call. Same functionality, just blocked out
a little differently.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# fe3ad593 15-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: do crc calculations outside loop

Move blocks of code out of loops in read_partial_message_section()
and read_partial_message(). They were only was getting called at
the end of the last iteration of the loop anyway.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# a9a0c51a 15-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: separate CRC calculation from byte swapping

Calculate CRC in a separate step from rearranging the byte order
of the result, to improve clarity and readability.

Use offsetof() to determine the number of bytes to include in the
CRC calculation.

In read_partial_message(), switch which value gets byte-swapped,
since the just-computed CRC is already likely to be in a register.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# bca064d2 15-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: use "do" in CRC-related Boolean variables

Change the name (and type) of a few CRC-related Boolean local
variables so they contain the word "do", to distingish their purpose
from variables used for holding an actual CRC value.

Note that in the process of doing this I identified a fairly serious
logic error in write_partial_msg_pages(): the value of "do_crc"
assigned appears to be the opposite of what it should be. No
attempt to fix this is made here; this change preserves the
erroneous behavior. The problem I found is documented here:
http://tracker.newdream.net/issues/2064

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# d3002b97 14-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: a few small changes

This gathers a number of very minor changes:
- use %hu when formatting the a socket address's address family
- null out the ceph_msgr_wq pointer after the queue has been
destroyed
- drop a needless cast in ceph_write_space()
- add a WARN() call in ceph_state_change() in the event an
unrecognized socket state is encountered
- rearrange the logic in ceph_con_get() and ceph_con_put() so
that:
- the reference counts are only atomically read once
- the values displayed via dout() calls are known to
be meaningful at the time they are formatted

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 41617d0c 14-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: make ceph_tcp_connect() return int

There is no real need for ceph_tcp_connect() to return the socket
pointer it creates, since it already assigns it to con->sock, which
is visible to the caller. Instead, have it return an error code,
which tidies things up a bit.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 6173d1f0 14-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: encapsulate some messenger cleanup code

Define a helper function to perform various cleanup operations. Use
it both in the exit routine and in the init routine in the event of
an error.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# e0f43c94 14-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: make ceph_msgr_wq private

The messenger workqueue has no need to be public. So give it static
scope.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 859eb799 14-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: encapsulate connection kvec operations

Encapsulate the operation of adding a new chunk of data to the next
open slot in a ceph_connection's out_kvec array. Also add a "reset"
operation to make subsequent add operations start at the beginning
of the array again.

Use these routines throughout, avoiding duplicate code and ensuring
all calls are handled consistently.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 963be4d7 14-Feb-2012 Alex Elder <elder@dreamhost.com>

libceph: move prepare_write_banner()

One of the arguments to prepare_write_connect() indicates whether it
is being called immediately after a call to prepare_write_banner().
Move the prepare_write_banner() call inside prepare_write_connect(),
and reinterpret (and rename) the "after_banner" argument so it
indicates that prepare_write_connect() should *make* the call
rather than should know it has already been made.

This was split out from the next patch to highlight this change in
logic.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 99f0f3b2 23-Jan-2012 Alex Elder <elder@dreamhost.com>

ceph: eliminate some abusive casts

This fixes some spots where a type cast to (void *) was used as
as a universal type hiding mechanism. Instead, properly cast the
type to the intended target type.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# bd406145 23-Jan-2012 Alex Elder <elder@dreamhost.com>

ceph: eliminate some needless casts

This eliminates type casts in some places where they are not
required.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# f64a9317 23-Jan-2012 Alex Elder <elder@dreamhost.com>

ceph: kill addr_str_lock spinlock; use atomic instead

A spinlock is used to protect a value used for selecting an array
index for a string used for formatting a socket address for human
consumption. The index is reset to 0 if it ever reaches the maximum
index value.

Instead, use an ever-increasing atomic variable as a sequence
number, and compute the array index by masking off all but the
sequence number's lowest bits. Make the number of entries in the
array a power of two to allow the use of such a mask (to avoid jumps
in the index value when the sequence number wraps).

The length of these strings is somewhat arbitrarily set at 60 bytes.
The worst-case length of a string produced is 54 bytes, for an IPv6
address that can't be shortened, e.g.:
[1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767
Change it so we arbitrarily use 64 bytes instead; if nothing else
it will make the array of these line up better in hex dumps.

Rename a few things to reinforce the distinction between the number
of strings in the array and the length of individual strings.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# a5bc3129 23-Jan-2012 Alex Elder <elder@dreamhost.com>

ceph: make use of "else" where appropriate

Rearrange ceph_tcp_connect() a bit, making use of "else" rather than
re-testing a value with consecutive "if" statements. Don't record a
connection's socket pointer unless the connect operation is
successful.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 57666519 23-Jan-2012 Alex Elder <elder@dreamhost.com>

ceph: use a shared zero page rather than one per messenger

Each messenger allocates a page to be used when writing zeroes
out in the event of error or other abnormal condition. Instead,
use the kernel ZERO_PAGE() for that purpose.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 182fac26 29-Feb-2012 Jim Schutt <jaschut@sandia.gov>

net/ceph: Only clear SOCK_NOSPACE when there is sufficient space in the socket buffer

The Ceph messenger would sometimes queue multiple work items to write
data to a socket when the socket buffer was full.

Fix this problem by making ceph_write_space() use SOCK_NOSPACE in the
same way that net/core/stream.c:sk_stream_write_space() does, i.e.,
clearing it only when sufficient space is available in the socket buffer.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@dreamhost.com>


# bc3b2d7f 15-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules

These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# ee3b56f2 23-Sep-2011 Noah Watkins <noahwatkins@gmail.com>

ceph: use kernel DNS resolver

Change ceph_parse_ips to take either names given as
IP addresses or standard hostnames (e.g. localhost).
The DNS lookup is done using the dns_resolver facility
similar to its use in AFS, NFS, and CIFS.

This patch defines CONFIG_CEPH_LIB_USE_DNS_RESOLVER
that controls if this feature is on or off.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# f0ed1b7c 09-Aug-2011 Sage Weil <sage@newdream.net>

libceph: warn on msg allocation failures

Any non-masked msg allocation failure should generate a warning and stack
trace to the console. All of these need to eventually be replaced by
safe preallocation or msgpools.

Signed-off-by: Sage Weil <sage@newdream.net>


# b61c2763 09-Aug-2011 Sage Weil <sage@newdream.net>

libceph: don't complain on msgpool alloc failures

The pool allocation failures are masked by the pool; there is no need to
spam the console about them. (That's the whole point of having the pool
in the first place.)

Mark msg allocations whose failure is safely handled as such.

Signed-off-by: Sage Weil <sage@newdream.net>


# c0d5f9db 16-Sep-2011 Jim Schutt <jaschut@sandia.gov>

libceph: initialize ack_stamp to avoid unnecessary connection reset

Commit 4cf9d544631c recorded when an outgoing ceph message was ACKed,
in order to avoid unnecessary connection resets when an OSD is busy.

However, ack_stamp is uninitialized, so there is a window between
when the message is sent and when it is ACKed in which handle_timeout()
interprets the unitialized value as an expired timeout, and resets
the connection unnecessarily.

Close the window by initializing ack_stamp.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>


# 4cf9d544 26-Jul-2011 Sage Weil <sage@newdream.net>

libceph: don't time out osd requests that haven't been received

Keep track of when an outgoing message is ACKed (i.e., the server fully
received it and, presumably, queued it for processing). Time out OSD
requests only if it's been too long since they've been received.

This prevents timeouts and connection thrashing when the OSDs are simply
busy and are throttling the requests they read off the network.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# a2a79609 12-May-2011 Sage Weil <sage@newdream.net>

libceph: add missing breaks in addr_set_port

Signed-off-by: Sage Weil <sage@newdream.net>


# 04177882 12-May-2011 Sage Weil <sage@newdream.net>

libceph: fix TAG_WAIT case

If we get a WAIT as a client something went wrong; error out. And don't
fall through to an unrelated case.

Signed-off-by: Sage Weil <sage@newdream.net>


# 12a2f643 12-May-2011 Sage Weil <sage@newdream.net>

libceph: use snprintf for unknown addrs

Signed-off-by: Sage Weil <sage@newdream.net>


# e8f54ce1 12-May-2011 Sage Weil <sage@newdream.net>

libceph: fix uninitialized value when no get_authorizer method is set

If there is no get_authorizer method we set the out_kvec to a bogus
pointer. The length is also zero in that case, so it doesn't much matter,
but it's better not to add the empty item in the first place.

Signed-off-by: Sage Weil <sage@newdream.net>


# 0da5d703 19-May-2011 Sage Weil <sage@newdream.net>

libceph: handle connection reopen race with callbacks

If a connection is closed and/or reopened (ceph_con_close, ceph_con_open)
it can race with a callback. con_work does various state checks for
closed or reopened sockets at the beginning, but drops con->mutex before
making callbacks. We need to check for state bit changes after retaking
the lock to ensure we restart con_work and execute those CLOSED/OPENING
tests or else we may end up operating under stale assumptions.

In Jim's case, this was causing 'bad tag' errors.

There are four cases where we re-take the con->mutex inside con_work: catch
them all and return EAGAIN from try_{read,write} so that we can restart
con_work.

Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>


# ca20892d 02-May-2011 Henry C Chang <henry.cy.chang@gmail.com>

libceph: fix ceph_msg_new error path

If memory allocation failed, calling ceph_msg_put() will cause GPF
since some of ceph_msg variables are not initialized first.

Fix Bug #970.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# e00de341 04-Mar-2011 Sage Weil <sage@newdream.net>

libceph: fix msgr standby handling

The standby logic used to be pretty dependent on the work requeueing
behavior that changed when we switched to WQ_NON_REENTRANT. It was also
very fragile.

Restructure things so that:
- We clear WRITE_PENDING when we set STANDBY. This ensures we will
requeue work when we wake up later.
- con_work backs off if STANDBY is set. There is nothing to do if we are
in standby.
- clear_standby() helper is called by both con_send() and con_keepalive(),
the two actions that can wake us up again. Move the connect_seq++
logic here.

Signed-off-by: Sage Weil <sage@newdream.net>


# e76661d0 03-Mar-2011 Sage Weil <sage@newdream.net>

libceph: fix msgr keepalive flag

There was some broken keepalive code using a dead variable. Shift to using
the proper bit flag.

Signed-off-by: Sage Weil <sage@newdream.net>


# 60bf8bf8 04-Mar-2011 Sage Weil <sage@newdream.net>

libceph: fix msgr backoff

With commit f363e45f we replaced a bunch of hacky workqueue mutual
exclusion logic with the WQ_NON_REENTRANT flag. One pieces of fallout is
that the exponential backoff breaks in certain cases:

* con_work attempts to connect.
* we get an immediate failure, and the socket state change handler queues
immediate work.
* con_work calls con_fault, we decide to back off, but can't queue delayed
work.

In this case, we add a BACKOFF bit to make con_work reschedule delayed work
next time it runs (which should be immediately).

Signed-off-by: Sage Weil <sage@newdream.net>


# 692d20f5 03-Mar-2011 Sage Weil <sage@newdream.net>

libceph: retry after authorization failure

If we mark the connection CLOSED we will give up trying to reconnect to
this server instance. That is appropriate for things like a protocol
version mismatch that won't change until the server is restarted, at which
point we'll get a new addr and reconnect. An authorization failure like
this is probably due to the server not properly rotating it's secret keys,
however, and should be treated as transient so that the normal backoff and
retry behavior kicks in.

Signed-off-by: Sage Weil <sage@newdream.net>


# 42961d23 25-Jan-2011 Sage Weil <sage@newdream.net>

libceph: fix socket write error handling

Pass errors from writing to the socket up the stack. If we get -EAGAIN,
return 0 from the helper to simplify the callers' checks.

Signed-off-by: Sage Weil <sage@newdream.net>


# 98bdb0aa 25-Jan-2011 Sage Weil <sage@newdream.net>

libceph: fix socket read error handling

If we get EAGAIN when trying to read from the socket, it is not an error.
Return 0 from the helper in this case to simplify the error handling cases
in the caller (indirectly, try_read).

Fix try_read to pass any error to it's caller (con_work) instead of almost
always returning 0. This let's us respond to things like socket
disconnects.

Signed-off-by: Sage Weil <sage@newdream.net>


# f363e45fd1 03-Jan-2011 Tejun Heo <tj@kernel.org>

net/ceph: make ceph_msgr_wq non-reentrant

ceph messenger code does a rather complex dancing around multithread
workqueue to make sure the same work item isn't executed concurrently
on different CPUs. This restriction can be provided by workqueue with
WQ_NON_REENTRANT.

Make ceph_msgr_wq non-reentrant workqueue with the default concurrency
level and remove the QUEUED/BUSY logic.

* This removes backoff handling in con_work() but it couldn't reliably
block execution of con_work() to begin with - queue_con() can be
called after the work started but before BUSY is set. It seems that
it was an optimization for a rather cold path and can be safely
removed.

* The number of concurrent work items is bound by the number of
connections and connetions are independent from each other. With
the default concurrency level, different connections will be
executed independently.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>


# d96c9043 13-Dec-2010 Sage Weil <sage@newdream.net>

ceph: fix msgr_init error path

create_workqueue() returns NULL on failure.

Signed-off-by: Sage Weil <sage@newdream.net>


# c5c6b19d 09-Nov-2010 Sage Weil <sage@newdream.net>

ceph: explicitly specify page alignment in network messages

The alignment used for reading data into or out of pages used to be taken
from the data_off field in the message header. This only worked as long
as the page alignment matched the object offset, breaking direct io to
non-page aligned offsets.

Instead, explicitly specify the page alignment next to the page vector
in the ceph_msg struct, and use that instead of the message header (which
probably shouldn't be trusted). The alloc_msg callback is responsible for
filling in this field properly when it sets up the page vector.

Signed-off-by: Sage Weil <sage@newdream.net>


# df9f86fa 01-Nov-2010 Sage Weil <sage@newdream.net>

ceph: fix small seq message skipping

If the client gets out of sync with the server message sequence number, we
normally skip low seq messages (ones we already received). The skip code
was also incrementing the expected seq, such that all subsequent messages
also appeared old and got skipped, and an eventual timeout on the osd
connection. This resulted in some lagging requests and console messages
like

[233480.882885] ceph: skipping osd22 10.138.138.13:6804 seq 2016, expected 2017
[233480.882919] ceph: skipping osd22 10.138.138.13:6804 seq 2017, expected 2018
[233480.882963] ceph: skipping osd22 10.138.138.13:6804 seq 2018, expected 2019
[233480.883488] ceph: skipping osd22 10.138.138.13:6804 seq 2019, expected 2020
[233485.219558] ceph: skipping osd22 10.138.138.13:6804 seq 2020, expected 2021
[233485.906595] ceph: skipping osd22 10.138.138.13:6804 seq 2021, expected 2022
[233490.379536] ceph: skipping osd22 10.138.138.13:6804 seq 2022, expected 2023
[233495.523260] ceph: skipping osd22 10.138.138.13:6804 seq 2023, expected 2024
[233495.923194] ceph: skipping osd22 10.138.138.13:6804 seq 2024, expected 2025
[233500.534614] ceph: tid 6023602 timed out on osd22, will reset osd

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sage Weil <sage@newdream.net>


# 3d14c5d2 06-Apr-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: factor out libceph from Ceph file system

This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph. This
is mostly a matter of moving files around. However, a few key pieces
of the interface change as well:

- ceph_client becomes ceph_fs_client and ceph_client, where the latter
captures the mon and osd clients, and the fs_client gets the mds client
and file system specific pieces.
- Mount option parsing and debugfs setup is correspondingly broken into
two pieces.
- The mon client gets a generic handler callback for otherwise unknown
messages (mds map, in this case).
- The basic supported/required feature bits can be expanded (and are by
ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>