History log of /linux-master/include/rdma/rdmavt_qp.h
Revision Date Author Comments
# 44da3730 23-Jul-2021 Leon Romanovsky <leon@kernel.org>

RDMA/rdmavt: Decouple QP and SGE lists allocations

The rdmavt QP has fields that are both needed for the control and data
path. Such mixed declaration caused to the very specific allocation flow
with kzalloc_node and SGE list embedded into the struct rvt_qp.

This patch separates QP creation to two: regular memory allocation for the
control path and specific code for the SGE list, while the access to the
later is performed through derefenced pointer.

Such pointer and its context are expected to be in the cache, so
performance difference is expected to be negligible, if any exists.

Link: https://lore.kernel.org/r/f66c1e20ccefba0db3c69c58ca9c897f062b4d1c.1627040189.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 54a485e9 28-Jul-2020 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Fix RQ counting issues causing use of an invalid RWQE

The lookaside count is improperly initialized to the size of the
Receive Queue with the additional +1. In the traces below, the
RQ size is 384, so the count was set to 385.

The lookaside count is then rarely refreshed. Note the high and
incorrect count in the trace below:

rvt_get_rwqe: [hfi1_0] wqe ffffc900078e9008 wr_id 55c7206d75a0 qpn c
qpt 2 pid 3018 num_sge 1 head 1 tail 0, count 385
rvt_get_rwqe: (hfi1_rc_rcv+0x4eb/0x1480 [hfi1] <- rvt_get_rwqe) ret=0x1

The head,tail indicate there is only one RWQE posted although the count
says 385 and we correctly return the element 0.

The next call to rvt_get_rwqe with the decremented count:

rvt_get_rwqe: [hfi1_0] wqe ffffc900078e9058 wr_id 0 qpn c
qpt 2 pid 3018 num_sge 0 head 1 tail 1, count 384
rvt_get_rwqe: (hfi1_rc_rcv+0x4eb/0x1480 [hfi1] <- rvt_get_rwqe) ret=0x1

Note that the RQ is empty (head == tail) yet we return the RWQE at tail 1,
which is not valid because of the bogus high count.

Best case, the RWQE has never been posted and the rc logic sees an RWQE
that is too small (all zeros) and puts the QP into an error state.

In the worst case, a server slow at posting receive buffers might fool
rvt_get_rwqe() into fetching an old RWQE and corrupt memory.

Fix by deleting the faulty initialization code and creating an
inline to fetch the posted count and convert all callers to use
new inline.

Fixes: f592ae3c999f ("IB/rdmavt: Fracture single lock used for posting and processing RWQEs")
Link: https://lore.kernel.org/r/20200728183848.22226.29132.stgit@awfm-01.aw.intel.com
Reported-by: Zhaojuan Guo <zguo@redhat.com>
Cc: <stable@vger.kernel.org> # 5.4.x
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Honggang Li <honli@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 6bf9d8f6 19-Jul-2020 Leon Romanovsky <leon@kernel.org>

RDMA/include: Replace license text with SPDX tags

The header files in RDMA subsystem are dual licensed and can be
described by simple SPDX tag, so replace all of them at once
together with making them use the same coding style for header
guard defines.

Link: https://lore.kernel.org/r/20200719072521.135260-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 84e3b19a 10-May-2020 Gary Leshner <Gary.S.Leshner@intel.com>

IB/hfi1: Remove module parameter for KDETH qpns

The module parameter for KDETH qpns is being removed in favor
of always using the default value of 0x80 as the qpn prefix.
Defines have been added for various KDETH values including
the prefix of 0x80.
The reserved range now starts at the base value for KDETH
qpns (0x80) and extends up to and including the last qpn for
other reserved QP prefixed types.
Adjust other QP prefixed define names to match KDETH defined
names.

Link: https://lore.kernel.org/r/20200511160600.173205.27508.stgit@awfm-01.aw.intel.com
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 0cb9e4f9 07-May-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

IB/rdmavt: Replace zero-length array with flexible-array

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Link: https://lore.kernel.org/r/20200507185342.GA14476@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 5b361328 12-Feb-2020 Gustavo A. R. Silva <gustavo@embeddedor.com>

RDMA: Replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Link: https://lore.kernel.org/r/20200213010425.GA13068@embeddedor.com
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> # added a few more


# 4ad6429d 19-Dec-2019 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Correct comments in rdmavt_qp.h header

Comments need to be with the definition of rvt_restart_sge().

Other comments were duplicated in sw/rdmavt/rc.c and were removed.

Fixes: 385156c5f2a6 ("IB/hfi: Move RC functions into a header file")
Link: https://lore.kernel.org/r/20191219211934.58387.88014.stgit@awfm-01.aw.intel.com
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 71994354 11-Sep-2019 Kaike Wan <kaike.wan@intel.com>

IB/{rdmavt, hfi1, qib}: Add a counter for credit waits

This patch adds a counter for credit waits to assist field debugging.

Link: https://lore.kernel.org/r/20190911113047.126040.10857.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 2b74c878 14-Jul-2019 Kaike Wan <kaike.wan@intel.com>

IB/hfi1: Unreserve a flushed OPFN request

When an OPFN request is flushed, the request is completed without
unreserving itself from the send queue. Subsequently, when a new
request is post sent, the following warning will be triggered:

WARNING: CPU: 4 PID: 8130 at rdmavt/qp.c:1761 rvt_post_send+0x72a/0x880 [rdmavt]
Call Trace:
[<ffffffffbbb61e41>] dump_stack+0x19/0x1b
[<ffffffffbb497688>] __warn+0xd8/0x100
[<ffffffffbb4977cd>] warn_slowpath_null+0x1d/0x20
[<ffffffffc01c941a>] rvt_post_send+0x72a/0x880 [rdmavt]
[<ffffffffbb4dcabe>] ? account_entity_dequeue+0xae/0xd0
[<ffffffffbb61d645>] ? __kmalloc+0x55/0x230
[<ffffffffc04e1a4c>] ib_uverbs_post_send+0x37c/0x5d0 [ib_uverbs]
[<ffffffffc04e5e36>] ? rdma_lookup_put_uobject+0x26/0x60 [ib_uverbs]
[<ffffffffc04dbce6>] ib_uverbs_write+0x286/0x460 [ib_uverbs]
[<ffffffffbb6f9457>] ? security_file_permission+0x27/0xa0
[<ffffffffbb641650>] vfs_write+0xc0/0x1f0
[<ffffffffbb64246f>] SyS_write+0x7f/0xf0
[<ffffffffbbb74ddb>] system_call_fastpath+0x22/0x27

This patch fixes the problem by moving rvt_qp_wqe_unreserve() into
rvt_qp_complete_swqe() to simplify the code and make it less
error-prone.

Fixes: ca95f802ef51 ("IB/hfi1: Unreserve a reserved request when it is completed")
Link: https://lore.kernel.org/r/20190715164528.74174.31364.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 2b0ad2da 28-Jun-2019 Michael J. Ruhl <michael.j.ruhl@intel.com>

IB/{rdmavt, hfi1, qib}: Add helpers to hide SWQE WR details

Add some helper functions to hide struct rvt_swqe details.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# d310c4bf 28-Jun-2019 Michael J. Ruhl <michael.j.ruhl@intel.com>

IB/{rdmavt, hfi1, qib}: Remove AH refcount for UD QPs

Historically rdmavt destroy_ah() has returned an -EBUSY when the AH has a
non-zero reference count. IBTA 11.2.2 notes no such return value or error
case:

Output Modifiers:
- Verb results:
- Operation completed successfully.
- Invalid HCA handle.
- Invalid address handle.

ULPs never test for this error and this will leak memory.

The reference count exists to allow for driver independent progress
mechanisms to process UD SWQEs in parallel with post sends. The SWQE will
hold a reference count until the UD SWQE completes and then drops the
reference.

Fix by removing need to reference count the AH. Add a UD specific
allocation to each SWQE entry to cache the necessary information for
independent progress. Copy the information during the post send
processing.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 5136bfea 28-Jun-2019 Kamenee Arumugam <kamenee.arumugam@intel.com>

IB/{hfi1, qib, rdmavt}: Put qp in error state when cq is full

When a completion queue is full, the associated queue pairs are not put
into the error state. According to the IBTA specification, this is a
violation.

Quote from IBTA spec:
C9-218: A Requester Class F error occurs when the CQ is inaccessible or
full and an attempt is made to complete a WQE. The Affected QP shall be
moved to the error state and affiliated asynchronous errors generated as
described in 11.6.3.1 Affiliated Asynchronous Events on page 678. The
current WQE and any subsequent WQEs are left in an unknown state.

C11-37: The CI shall generate a CQ Error when a CQ overrun is
detected. This condition will result in an Affiliated Asynchronous Error
for any associated Work Queues when they attempt to use that
CQ. Completions can no longer be added to the CQ. It is not guaranteed
that completions present in the CQ at the time the error occurred can be
retrieved. Possible causes include a CQ overrun or a CQ protection error.

Put the qp in error state when cq is full. Implement a state called full
to continue to put other associated QPs in error state.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# f592ae3c 28-Jun-2019 Kamenee Arumugam <kamenee.arumugam@intel.com>

IB/rdmavt: Fracture single lock used for posting and processing RWQEs

Usage of single lock prevents fetching posted and processing receive work
queue entries from progressing simultaneously and impacts overall
performance.

Fracture the single lock used for posting and processing Receive Work
Queue Entries (RWQEs) to allow the circular buffer to be filled and
emptied at the same time. Two new spinlocks - one for the producers and
one for the consumers used for posting and processing RWQEs simultaneously
and the two indices are define on two different cache lines. The threshold
count is used to avoid reading other index in different cache line every
time.

Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# dabac6e4 28-Jun-2019 Kamenee Arumugam <kamenee.arumugam@intel.com>

IB/hfi1: Move receive work queue struct into uapi directory

The rvt_rwqe and rvt_rwq struct elements are shared between rdmavt and the
providers but are not in uapi directory. As per the comment in
https://marc.info/?l=linux-rdma&m=152296522708522&w=2, The hfi1 driver and
the rdma core driver are not using shared structures in the uapi
directory.

Move rvt_rwqe and rvt_rwq struct into rvt-abi.h header in uapi directory.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 239b0e52 28-Jun-2019 Kamenee Arumugam <kamenee.arumugam@intel.com>

IB/hfi1: Move rvt_cq_wc struct into uapi directory

The rvt_cq_wc struct elements are shared between rdmavt and the providers
but not in uapi directory. As per the comment in
https://marc.info/?l=linux-rdma&m=152296522708522&w=2 The hfi1 driver and
the rdma core driver are not using shared structures in the uapi
directory.

In that case, move rvt_cq_wc struct into the rvt-abi.h header file and
create a rvt_k_cq_w for the kernel completion queue.

Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 4a9ceb7d 13-Jun-2019 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/{rdmavt, qib, hfi1}: Convert to new completion API

Convert all completions to use the new completion routine that
fixes a race between post send and completion where fields from
a SWQE can be read after SWQE has been freed.

This patch also addresses issues reported in
https://marc.info/?l=linux-kernel&m=155656897409107&w=2.

The reserved operation path has no need for any barrier.

The barrier for the other path is addressed by the
smp_load_acquire() barrier.

Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# f56044d6 13-Jun-2019 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Add new completion inline

There is opencoded send completion logic all over all
the drivers.

We need to convert to this routine to enforce ordering
issues for completions. This routine fixes an ordering
issue where the read of the SWQE fields necessary for creating
the completion can race with a post send if the post send catches
a send queue at the edge of being full. Is is possible in that situation
to read SWQE fields that are being written.

This new routine insures that SWQE fields are read prior to advancing
the index that post send uses to determine queue fullness.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# d40f69c9 12-Apr-2019 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/{rdmavt, qib, hfi1}: Use new routine to release reference counts

The reference count adjustments on reference count completion
are open coded throughout.

Add a routine to do all reference count adjustments and use.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 715ab1a8 11-Apr-2019 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Fix ab/ba include issues

The currently include file ordering for rdmavt headers has an
ab/ba include issue the precludes using inlines from rdma_vt.h
in rdmavt_qp.h.

At the heart of the issue is that rdma_vt.h includes rdmavt_qp.h.

Fix the ordering issue by adjusting rdma_vt.h to not require rdmavt_qp.h
and move qp related inlines to rdmavt_qp.h.

Additionally, promote rvt_mmap_info to rdma_vt.h since it is shared
by rdmavt_cq.h and rdmavt_qp.h.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# ea752bc5 11-Apr-2019 Kaike Wan <kaike.wan@intel.com>

IB/{rdmavt, hfi1): Miscellaneous comment fixes

This patch fixes miscellaneous comment errors.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 3c6cb20a 23-Jan-2019 Kaike Wan <kaike.wan@intel.com>

IB/hfi1: Add TID RDMA WRITE functionality into RDMA verbs

This patch integrates TID RDMA WRITE protocol into normal RDMA verbs
framework. The TID RDMA WRITE protocol is an end-to-end protocol
between the hfi1 drivers on two OPA nodes that converts a qualified
RDMA WRITE request into a TID RDMA WRITE request to avoid data copying
on the responder side.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 4f9264d1 23-Jan-2019 Kaike Wan <kaike.wan@intel.com>

IB/hfi1: Add an s_acked_ack_queue pointer

The s_ack_queue is managed by two pointers into the ring:
r_head_ack_queue and s_tail_ack_queue. r_head_ack_queue is the index of
where the next received request is going to be placed and s_tail_ack_queue
is the entry of the request currently being processed. This works
perfectly fine for normal Verbs as the requests are processed one at a
time and the s_tail_ack_queue is not moved until the request that it
points to is fully completed.

In this fashion, s_tail_ack_queue constantly chases r_head_ack_queue and
the two pointers can easily be used to determine "queue full" and "queue
empty" conditions.

The detection of these two conditions are imported in determining when an
old entry can safely be overwritten with a new received request and the
resources associated with the old request be safely released.

When pipelined TID RDMA WRITE is introduced into this mix, things look
very different. r_head_ack_queue is still the point at which a newly
received request will be inserted, s_tail_ack_queue is still the
currently processed request. However, with pipelined TID RDMA WRITE
requests, s_tail_ack_queue moves to the next request once all TID RDMA
WRITE responses for that request have been sent. The rest of the protocol
for a particular request is managed by other pointers specific to TID RDMA
- r_tid_tail and r_tid_ack - which point to the entries for which the next
TID RDMA DATA packets are going to arrive and the request for which
the next TID RDMA ACK packets are to be generated, respectively.

What this means is that entries in the ring, which are "behind"
s_tail_ack_queue (entries which s_tail_ack_queue has gone past) are no
longer considered complete. This is where the problem is - a newly
received request could potentially overwrite a still active TID RDMA WRITE
request.

The reason why the TID RDMA pointers trail s_tail_ack_queue is that the
normal Verbs send engine uses s_tail_ack_queue as the pointer for the next
response. Since TID RDMA WRITE responses are processed by the normal Verbs
send engine, s_tail_ack_queue had to be moved to the next entry once all
TID RDMA WRITE response packets were sent to get the desired pipelining
between requests. Doing otherwise would mean that the normal Verbs send
engine would not be able to send the TID RDMA WRITE responses for the next
TID RDMA request until the current one is fully completed.

This patch introduces the s_acked_ack_queue index to point to the next
request to complete on the responder side. For requests other than TID
RDMA WRITE, s_acked_ack_queue should always be kept in sync with
s_tail_ack_queue. For TID RDMA WRITE request, it may fall behind
s_tail_ack_queue.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 039cd3da 23-Jan-2019 Kaike Wan <kaike.wan@intel.com>

IB/hfi1: Increment the retry timeout value for TID RDMA READ request

The RC retry timeout value is based on the estimated time for the
response packet to come back. However, for TID RDMA READ request, due
to the use of header suppression, the driver is normally not notified
for each incoming response packet until the last TID RDMA READ response
packet. Consequently, the retry timeout value should be extended to
cover the transaction time for the entire length of a segment (default
256K) instead of that for a single packet. This patch addresses the
issue by introducing new retry timer functions to account for multiple
packets and wrapper functions for backward compatibility.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 838b6fd2 23-Jan-2019 Kaike Wan <kaike.wan@intel.com>

IB/hfi1: TID RDMA RcvArray programming and TID allocation

TID entries are used by hfi1 hardware to receive data payload from
incoming packets directly into a user buffer and thus avoid data copying
by software. This patch implements the functions for TID allocation,
freeing, and programming TID RcvArray entries in hardware for kernel
clients. TID entries are managed via lists of TID groups similar to PSM.
Furthermore, to track TID resource allocation for each request, software
flows are also allocated and freed as needed. Since software flows
consume large amount of memory for tracking TID allocation and freeing,
it is generally desirable to allocate them dynamically in the send queue
and only for TID RDMA requests, but pre-allocate them for receive queue
because the send queue could have thousands of entries while the receive
queue has only a limited number of entries.

Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 385156c5 23-Jan-2019 Kaike Wan <kaike.wan@intel.com>

IB/hfi: Move RC functions into a header file

This patch moves some RC helper functions into a header file so that
they can be called from both RC and TID RDMA functions. In addition,
a common function for rewinding a request is created in rdmavt so that
it can be shared between qib and hfi1 driver.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 15703461 26-Sep-2018 Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>

IB/{hfi1, qib, rdmavt}: Move ruc_loopback to rdmavt

This patch moves ruc_loopback() from hfi1 into rdmavt for code sharing
with the qib driver.

Reviewed-by: Brian Welty <brian.welty@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 116aa033 26-Sep-2018 Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>

IB/{hfi1, qib, rdmavt}: Move send completion logic to rdmavt

Moving send completion code into rdmavt in order to have shared logic
between qib and hfi1 drivers.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 019f118b 26-Sep-2018 Brian Welty <brian.welty@intel.com>

IB/{hfi1, qib, rdmavt}: Move copy SGE logic into rdmavt

This patch moves hfi1_copy_sge() into rdmavt for sharing with qib.
This patch also moves all the wss_*() functions into rdmavt as
several wss_*() functions are called from hfi1_copy_sge()

When SGE copy mode is adaptive, cacheless copy may be done in some cases
for performance reasons. In those cases, X86 cacheless copy function
is called since the drivers that use rdmavt and may set SGE copy mode
to adaptive are X86 only. For this reason, this patch adds
"depends on X86_64" to rdmavt/Kconfig.

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 2e2ba09e 04-Jun-2018 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt, IB/hfi1: Create device dependent s_flags

Move some s_flags defines out of rdmavt and into hfi1 because they are
hfi1 specific and therefore should remain in the driver instead of
bubbling up to rdmavt.

Document device specific ranges in rdmavt and remap
those in hfi1.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 832369fa 02-May-2018 Brian Welty <brian.welty@intel.com>

IB/{hfi1, qib, rdmavt}: Move logic to allocate receive WQE into rdmavt

Moving receive-side WQE allocation logic into rdmavt will allow
further code reuse between qib and hfi1 drivers.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 7ebfc93e 02-Oct-2017 Sebastian Sanchez <sebastian.sanchez@intel.com>

IB/rdmavt: Correct issues with read-mostly and send size cache lines

The s_ahgpsn was incorrectly placed in the read-mostly section of the QP
and the s_curr_size and s_hdrwords are oversized. The misplaced
s_ahgpsn will cause the read-mostly cachelines to thrash.

Place s_ahgpsn in the send side cache lines and correctly size and
s_hdrwords and s_cur_size to keep the send side cachelines at the same
size.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 0208da90 28-Aug-2017 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Handle dereg of inuse MRs properly

A destroy of an MR prior to destroying the QP can cause the following
diagnostic if the QP is referencing the MR being de-registered:

hfi1 0000:05:00.0: hfi1_0: rvt_dereg_mr timeout mr ffff8808562108
00 pd ffff880859b20b00

The solution is to when the a non-zero refcount is encountered when
the MR is destroyed the QPs needs to be iterated looking for QPs in
the same PD as the MR. If rvt_qp_mr_clean() detects any such QP
references the rkey/lkey, the QP needs to be put into an error state
via a call to rvt_qp_error() which will trigger the clean up of any
stuck references.

This solution is as specified in IBTA 1.3 Volume 1 11.2.10.5.

[This is reproduced with the 0.4.9 version of qperf and the rc_bw test]

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 4734b4f4 28-Aug-2017 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Add QP iterator API for QPs

There are currently 3 spots in the qib and hfi1 driver that have
knowledge of the internal QP hash list that should only be in
scope to rdmavt QP code.

Add an iterator API for processing all QPs to hide the
nature of the RCU hashlist.

The API consists of:
- rvt_qp_iter_init()
* For iterating QPs one at a time for seq_file semantics
- rvt_qp_iter_next()
* For iterating QPs one at a time for seq_file semantics
- rvt_qp_iter()
* For iterating all QPs

The first two are used for things like seq_file prints.

The last is for code that just needs to iterate all QPs
in the system.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 16570d3d 04-Aug-2017 Sebastian Sanchez <sebastian.sanchez@intel.com>

IB/hfi1: Remove pmtu from the QP structure

The pmtu field doens't have be stored in the QP structure
as it can easily be calculated when needed.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# a25ce427 17-Jun-2017 Kaike Wan <kaike.wan@intel.com>

IB/rdmavt: Setting of QP timeout can overflow jiffies computation

Current computation of qp->timeout_jiffies in rvt_modify_qp() will cause
overflow due to the fact that the input to the function usecs_to_jiffies
is only 32-bit ( unsigned int). Overflow will occur when attr->timeout is
equal to or greater than 30. The consequence is unnecessarily excessive
retry and thus degradation of the system performance.

This patch fixes the problem by limiting the input to 5-bit and calling
usecs_to_jiffies() before multiplying the scaling factor.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 7dafbab3 12-May-2017 Don Hiatt <don.hiatt@intel.com>

IB/hfi1: Add functions to parse BTH/IB headers

Improve code readablity by adding inline functions
to read specific BTH/IB fields without knowledge of
byte offsets.

Reviewed-by: Brian Welty <brian.welty@intel.com>
Reviewed-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 688f21c0 04-May-2017 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/hfi1, IB/rdmavt: Move r_adefered to r_lock cache line

This field is causing excessive cache line bouncing.

There are spare bytes in the r_lock cache line so the best approach
is to make an rvt QP field and remove from the hfi1 priv field.

Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 90898850 29-Apr-2017 Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>

IB/core: Rename struct ib_ah_attr to rdma_ah_attr

This patch simply renames struct ib_ah_attr to
rdma_ah_attr as these fields specify attributes that are
not necessarily specific to IB.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# aad9ff97 09-Apr-2017 Michael J. Ruhl <michael.j.ruhl@intel.com>

IB/rdmavt/hfi1/qib: Use the MGID and MLID for multicast addressing

The Infiniband spec defines "A multicast address is defined by a
MGID and a MLID" (section 10.5).

The current code only uses the MGID for identifying multicast groups.
Update the driver to be compliant with this definition.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 44dcfa4b 20-Mar-2017 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Avoid reseting wqe send_flags in unreserve

The wqe should be read only and in fact the superfluous reset of the
RVT_SEND_RESERVE_USED flag causes an issue where reserved operations
elicit a bad completion to the ULP.

The maintenance of the flag is now entirely within rvt_post_one_wr()
where a reserved operation will set the flag and a non-reserved operation
will insure the operation that is about to be posted has the flag reset.

Fixes: Commit 856cc4c237ad ("IB/hfi1: Add the capability for reserved operations")
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 43a474aa 20-Mar-2017 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt, IB/hfi1, IB/qib: Make wc opcode translation driver dependent

The work to create a completion helper moved the translation of send
wqe operations to completion opcodes to rdmvat.

This precludes having driver dependent operations. Make the translation
driver dependent by doing the translation in the driver prior to the
rvt_qp_swqe_complete() call using restored translation tables.

Fixes: Commit f2dc9cdce83c ("IB/rdmavt: Add a send completion helper")
Fixes: Commit 0771da5a6e9d ("IB/hfi1,IB/qib: Use new send completion helper")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 832666c1 08-Feb-2017 Don Hiatt <don.hiatt@intel.com>

IB/hfi1, qib, rdmavt: Move AETH defines to rdma/ib_hdrs.h

Rename RVT AETH defines and export in rdma/ib_hdrs.h

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 881fccb8 08-Feb-2017 Don Hiatt <don.hiatt@intel.com>

IB/hfi1: Add rvt_rnr_tbl_to_usec function

Return usec from an index into ib_rvt_rnr_table.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# f9215b5e 08-Feb-2017 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt, IB/hfi1, IB/qib: Correct ack count for passive (RTR) QPs

The send complete for RC QPs mismanages the ack count when the
responder side is only in RTR.

A QP in that state cannot send requests, but it can be the target
for operations that elicit responses.

Adjust the RC completion logic to correct the count maintenance
by reflecting RECV_OK in a new state test.

Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 0128fcea 08-Feb-2017 Brian Welty <brian.welty@intel.com>

IB/hfi1, rdmavt: Update copy_sge to use boolean arguments

Convert copy_sge and related SGE state functions to use boolean.
For determining if QP is in user mode, add helper function in rdmavt_qp.h.
This is used to determine if QP needs the last byte ordering.
While here, change rvt_pd.user to a boolean.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 11a10d4b 08-Feb-2017 Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>

IB/rdmavt: Adding timer logic to rdmavt

To move common code across target to rdmavt for code reuse.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 696513e8 08-Feb-2017 Brian Welty <brian.welty@intel.com>

IB/hfi1, qib, rdmavt: Move AETH credit functions into rdmavt

Add rvt_compute_aeth() and rvt_get_credit() as shared functions in
rdmavt, moved from hfi1/qib logic.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# beb5a042 08-Feb-2017 Brian Welty <brian.welty@intel.com>

IB/hfi1, qib, rdmavt: Move two IB event functions into rdmavt

Add rvt_rc_error() and rvt_comm_est() as shared functions in
rdmavt, moved from hfi1/qib logic.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 5dc80605 07-Dec-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt, IB/hfi1, IB/qib: Add inlines for mtu division

Add rvt_div_round_up_mtu() and rvt_div_mtu() routines to
do the computation based on the pmtu and the log_pmtu.

Change divides in qib, hfi1 to use the new inlines.

Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# f6475223 07-Dec-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Add swqe mr deref helper

Add a helper to release mr references held by
an swqe.

Reviewed-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# f2dc9cdc 07-Dec-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Add a send completion helper

This is for use by client drivers to drive
send completions into a CQ.

A new exported table allows for the mapping
of ib_wr_opcode into a ib_wc_opcode.

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 4107b8a0 06-Sep-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Add functions to get and release QP references

This centralizes the function and improves code readability.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# fe508272 27-Jul-2016 Ira Weiny <ira.weiny@intel.com>

IB/rdmavt: Eliminate redundant opcode test in mr ref clear

The use of the specific opcode test is redundant since
all ack entry users correctly manipulate the mr pointer
to selectively trigger the reference clearing.

The overly specific test hinders the use of implementation
specific operations.

The change needs to get rid of the union to insure that
an atomic value is not seen as an MR pointer.

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# d9b13c20 25-Jul-2016 Jianxin Xiong <jianxin.xiong@intel.com>

IB/rdmavt, hfi1: Fix NFSoRDMA failure with FRMR enabled

Hanging has been observed while writing a file over NFSoRDMA. Dmesg on
the server contains messages like these:

[ 931.992501] svcrdma: Error -22 posting RDMA_READ
[ 952.076879] svcrdma: Error -22 posting RDMA_READ
[ 982.154127] svcrdma: Error -22 posting RDMA_READ
[ 1012.235884] svcrdma: Error -22 posting RDMA_READ
[ 1042.319194] svcrdma: Error -22 posting RDMA_READ

Here is why:

With the base memory management extension enabled, FRMR is used instead
of FMR. The xprtrdma server issues each RDMA read request as the following
bundle:

(1)IB_WR_REG_MR, signaled;
(2)IB_WR_RDMA_READ, signaled;
(3)IB_WR_LOCAL_INV, signaled & fencing.

These requests are signaled. In order to generate completion, the fast
register work request is processed by the hfi1 send engine after being
posted to the work queue, and the corresponding lkey is not valid until
the request is processed. However, the rdmavt driver validates lkey when
the RDMA read request is posted and thus it fails immediately with error
-EINVAL (-22).

This patch changes the work flow of local operations (fast register and
local invalidate) so that fast register work requests are always
processed immediately to ensure that the corresponding lkey is valid
when subsequent work requests are posted. Local invalidate requests are
processed immediately if fencing is not required and no previous local
invalidate request is pending.

To allow completion generation for signaled local operations that have
been processed before posting to the work queue, an internal send flag
RVT_SEND_COMPLETION_ONLY is added. The hfi1 send engine checks this flag
and only generates completion for such requests.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 856cc4c2 25-Jul-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/hfi1: Add the capability for reserved operations

This fix allows for support of in-kernel reserved operations
without impacting the ULP user.

The low level driver can register a non-zero value which
will be transparently added to the send queue size and hidden
from the ULP in every respect.

ULP post sends will never see a full queue due to a reserved
post send and reserved operations will never exceed that
registered value.

The s_avail will continue to track the ULP swqe availability
and the difference between the reserved value and the reserved
in use will track reserved availabity.

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# d9f87239 25-Jul-2016 Jianxin Xiong <jianxin.xiong@intel.com>

IB/rdmavt: Handle local operations in post send

Some work requests are local operations, such as IB_WR_REG_MR and
IB_WR_LOCAL_INV. They differ from non-local operations in that:

(1) Local operations can be processed immediately without being posted
to the send queue if neither fencing nor completion generation is needed.
However, to ensure correct ordering, once a local operation is posted to
the work queue due to fencing or completion requiement, all subsequent
local operations must also be posted to the work queue until all the
local operations on the work queue have completed.

(2) Local operations don't send packets over the wire and thus don't
need (and shouldn't update) the packet sequence numbers.

Define a new a flag bit for the post send table to identify local
operations.

Add a new field to the QP structure to track the number of local
operations on the send queue to determine if direct processing of new
local operations should be enabled/disabled.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# afcf8f76 01-Jul-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Add data structures and routines for table driven post send

Add flexibility for driver dependent operations in post send
because different drivers will have differing post send
operation support.

This includes data structure definitions to support a table
driven scheme along with the necessary validation routine
using the new table.

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 8b103e9c 24-May-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdamvt: Fix rdmavt s_ack_queue sizing

rdmavt allows the driver to specify the size of the ack queue, but
only uses it for the modify QP limit testing for setting the atomic
limit value.

The driver dependent size is now used to size the s_ack_queue ring
dynamicially.

Since the driver knows its size, the driver will use its define
for any ring size dependent code.

Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# f39cc34d 12-Apr-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: Fix adaptive pio hang

The RVT_S_WAIT_PIO_DRAIN flag was missing from
the set of flags indicating a qp is waiting
on a resource.

This caused the sleep/wakeup for adaptive pio
drain to lose a wakeup "hanging" a QP.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# ef086c0d 07-Mar-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/hfi1: Report pid in qp_stats to aid debug

Tracking user/QP ownership is needed to debug issues with
user ULPs like OpenMPI.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 14553ca1 14-Feb-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

staging/rdma/hfi1: Adaptive PIO for short messages

The change requires a new pio_busy field in the iowait structure to
track the number of outstanding pios. The new counter together
with the sdma counter serve as the basis for a packet by packet decision
as to which egress mechanism to use. Since packets given to different
egress mechanisms are not ordered, this scheme will preserve the order.

The iowait drain/wait mechanisms are extended for a pio case. An
additional qp wait flag is added for the PIO drain wait case.

Currently the only pio wait is for buffers, so the no_bufs_available()
routine name is changed to pio_wait() and a third argument is passed
with one of the two pio wait flags to generalize the routine. A module
parameter is added to hold a configurable threshold. For now, the
module parameter is zero.

A heuristic routine is added to return the func pointer of the proper
egress routine to use.

The heuristic is as follows:
- SMI always uses pio
- GSI,UD qps <= threshold use pio
- UD qps > threadhold use sdma
o No coordination with sdma is required because order is not required
and this qp pio count is not maintained for UD
- RC/UC ONLY packets <= threshold chose as follows:
o If sdmas pending, use SDMA
o Otherwise use pio and enable the pio tracking count at
the time the pio buffer is allocated
- RC/UC ONLY packets > threshold use SDMA
o If pio's are pending the pio_wait with the new wait flag is
called to delay for pios to drain

The threshold is potentially reduced by the QP's mtu.

The sc_buffer_alloc() has two additional args (a callback, a void *)
which are exploited by the RC/UC cases to pass a new complete routine
and a qp *.

When the shadow ring completes the credit associated with a packet,
the new complete routine is called. The verbs_pio_complete() will then
decrement the busy count and trigger any drain waiters in qp destroy
or reset.

Reviewed-by: Jubin John <jubin.john@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# d2421a82 14-Feb-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmvt: close send engine struct holes

pahole noted the wasted 4 bytes after s_lock and r_lock.

Move s_flags and r_psn to fill the holes.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 79a225be 14-Feb-2016 Dennis Dalessandro <dennis.dalessandro@intel.com>

IB/rdmavt: Remove unnecessary exported functions

Remove exported functions which are no longer required as the
functionality has moved into rdmavt. This also requires re-ordering some
of the functions since their prototype no longer appears in a header
file. Rather than add forward declarations it is just cleaner to
re-order some of the functions.

Reviewed-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 46a80d62 14-Feb-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/qib, staging/rdma/hfi1: add s_hlock for use in post send

This patch adds an additional lock to reduce contention on the s_lock.

This lock is used in post_send() so that the post_send is not
serialized with the send engine and other send related processing.

To do this the s_next_psn is now maintained on post_send() while
post_send() related fields are moved to a new cache line. There is
an s_avail maintained for the post_send() to mitigate trading cache
lines with the send engine. The lock is released/acquired around
releasing the just built packet to the egress mechanism.

Reviewed-by: Jubin John <jubin.john@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# bfee5e32 09-Feb-2016 Vennila Megavannan <vennila.megavannan@intel.com>

IB/rdmavt, staging/rdma/hfi1: use qps to dynamically scale timeout value

A busy_jiffies variable is maintained and updated when rc qps are
created and deleted. busy_jiffies is a scaled value of the number
of rc qps in the device. busy_jiffies is incremented every rc qp
scaling interval. busy_jiffies is added to the rc timeout
in add_retry_timer and mod_retry_timer. The rc qp scaling interval
is selected based on extensive performance evaluation of targeted
workloads.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Vennila Megavannan <vennila.megavannan@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 066fad28 04-Feb-2016 Mike Marciniszyn <mike.marciniszyn@intel.com>

IB/rdmavt: remove unused qp field

The field is a vestige from ipath.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# fe314195 22-Jan-2016 Dennis Dalessandro <dennis.dalessandro@intel.com>

IB/rdmavt: Fix copyright date

Update all files added by rdmavt which do not yet have 2016 as the
copyright year.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 2b047ea7 22-Jan-2016 Ira Weiny <ira.weiny@intel.com>

IB/rdmavt: Remove unused variable from Queue Pair

s_sde should be in the low level driver QP private data.

Remove the definition from rvt_qp.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 4e74080b 22-Jan-2016 Dennis Dalessandro <dennis.dalessandro@intel.com>

IB/rdmavt: Add multicast functions

This patch adds in the multicast add and remove functions as well as the
ancillary infrastructure needed.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 3b0b3fb3 22-Jan-2016 Dennis Dalessandro <dennis.dalessandro@intel.com>

IB/rdmavt: Add modify qp

Add modify qp and supporting functions.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# bfbac097 22-Jan-2016 Dennis Dalessandro <dennis.dalessandro@intel.com>

IB/rdmavt: Add post send to rdmavt

Add in a post_send and post_one_send to rdmavt. The ULP will provide a WQE
to rdmavt which will then walk and queue each element. Rdmavt will then
queue the work to be done in the driver or kick the driver's progress
routine.

There needs to be a follow on patch which adds in another lock for the
head of the queue so that it can be added to and read from in parallel.
This will touch protocol handlers and require other changes in the
drivers. This will be done separately.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 5a9cf6f2 22-Jan-2016 Dennis Dalessandro <dennis.dalessandro@intel.com>

IB/rdmavt: Export reset_qp in rdmavt

Until all queue pair functionality is moved to rdmavt we need to provide
access to the reset function. This is only temporary and will be reverted
back to a static, non exported function in the end.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 050eb7fb 22-Jan-2016 Dennis Dalessandro <dennis.dalessandro@intel.com>

IB/rdmavt: Add R and S flags for queue pairs

Use the flags originally provided for hfi1 in the rdmavt driver. These will
be made available to drivers in the qp header file.

Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 0acb0cc7 06-Jan-2016 Dennis Dalessandro <dennis.dalessandro@intel.com>

IB/rdmavt: Initialize and teardown of qpn table

Add table init as well as teardown for handling qpn maps. Drivers can still
provide this functionality by setting the QP_INIT_DRIVER bit.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# b4e64397 06-Jan-2016 Dennis Dalessandro <dennis.dalessandro@intel.com>

IB/rdmavt: Break rdma_vt main include header file up

Until all functionality is moved over to rdmavt drivers still need to
access a number of fields in data structures that are predominantly
meant to be used by rdmavt. Once these rdmavt_<ibta_object>.h header
files are no longer being touched by drivers their content should be
moved to rdmavt/<ibta_object>.h. While here move a couple #defines
over to more general IB verbs header files because they fit better.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>