History log of /linux-master/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
Revision Date Author Comments
# 64854534 11-Jan-2024 Christian Heusel <christian@heusel.eu>

RDMA/ipoib: Print symbolic error name instead of error code

Utilize the %pe print specifier to get the symbolic error name as a
string (i.e "-ENOMEM") in the log message instead of the error code to
increase its readability.

This change was suggested in
https://lore.kernel.org/all/92972476-0b1f-4d0a-9951-af3fc8bc6e65@suswa.mountain/

Signed-off-by: Christian Heusel <christian@heusel.eu>
Link: https://lore.kernel.org/r/20240111141311.987098-1-christian@heusel.eu
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>


# 4f973e21 12-Dec-2023 Daniel Vacek <neelx@redhat.com>

IB/ipoib: Fix mcast list locking

Releasing the `priv->lock` while iterating the `priv->multicast_list` in
`ipoib_mcast_join_task()` opens a window for `ipoib_mcast_dev_flush()` to
remove the items while in the middle of iteration. If the mcast is removed
while the lock was dropped, the for loop spins forever resulting in a hard
lockup (as was reported on RHEL 4.18.0-372.75.1.el8_6 kernel):

Task A (kworker/u72:2 below) | Task B (kworker/u72:0 below)
-----------------------------------+-----------------------------------
ipoib_mcast_join_task(work) | ipoib_ib_dev_flush_light(work)
spin_lock_irq(&priv->lock) | __ipoib_ib_dev_flush(priv, ...)
list_for_each_entry(mcast, | ipoib_mcast_dev_flush(dev = priv->dev)
&priv->multicast_list, list) |
ipoib_mcast_join(dev, mcast) |
spin_unlock_irq(&priv->lock) |
| spin_lock_irqsave(&priv->lock, flags)
| list_for_each_entry_safe(mcast, tmcast,
| &priv->multicast_list, list)
| list_del(&mcast->list);
| list_add_tail(&mcast->list, &remove_list)
| spin_unlock_irqrestore(&priv->lock, flags)
spin_lock_irq(&priv->lock) |
| ipoib_mcast_remove_list(&remove_list)
(Here, `mcast` is no longer on the | list_for_each_entry_safe(mcast, tmcast,
`priv->multicast_list` and we keep | remove_list, list)
spinning on the `remove_list` of | >>> wait_for_completion(&mcast->done)
the other thread which is blocked |
and the list is still valid on |
it's stack.)

Fix this by keeping the lock held and changing to GFP_ATOMIC to prevent
eventual sleeps.
Unfortunately we could not reproduce the lockup and confirm this fix but
based on the code review I think this fix should address such lockups.

crash> bc 31
PID: 747 TASK: ff1c6a1a007e8000 CPU: 31 COMMAND: "kworker/u72:2"
--
[exception RIP: ipoib_mcast_join_task+0x1b1]
RIP: ffffffffc0944ac1 RSP: ff646f199a8c7e00 RFLAGS: 00000002
RAX: 0000000000000000 RBX: ff1c6a1a04dc82f8 RCX: 0000000000000000
work (&priv->mcast_task{,.work})
RDX: ff1c6a192d60ac68 RSI: 0000000000000286 RDI: ff1c6a1a04dc8000
&mcast->list
RBP: ff646f199a8c7e90 R8: ff1c699980019420 R9: ff1c6a1920c9a000
R10: ff646f199a8c7e00 R11: ff1c6a191a7d9800 R12: ff1c6a192d60ac00
mcast
R13: ff1c6a1d82200000 R14: ff1c6a1a04dc8000 R15: ff1c6a1a04dc82d8
dev priv (&priv->lock) &priv->multicast_list (aka head)
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#5 [ff646f199a8c7e00] ipoib_mcast_join_task+0x1b1 at ffffffffc0944ac1 [ib_ipoib]
#6 [ff646f199a8c7e98] process_one_work+0x1a7 at ffffffff9bf10967

crash> rx ff646f199a8c7e68
ff646f199a8c7e68: ff1c6a1a04dc82f8 <<< work = &priv->mcast_task.work

crash> list -hO ipoib_dev_priv.multicast_list ff1c6a1a04dc8000
(empty)

crash> ipoib_dev_priv.mcast_task.work.func,mcast_mutex.owner.counter ff1c6a1a04dc8000
mcast_task.work.func = 0xffffffffc0944910 <ipoib_mcast_join_task>,
mcast_mutex.owner.counter = 0xff1c69998efec000

crash> b 8
PID: 8 TASK: ff1c69998efec000 CPU: 33 COMMAND: "kworker/u72:0"
--
#3 [ff646f1980153d50] wait_for_completion+0x96 at ffffffff9c7d7646
#4 [ff646f1980153d90] ipoib_mcast_remove_list+0x56 at ffffffffc0944dc6 [ib_ipoib]
#5 [ff646f1980153de8] ipoib_mcast_dev_flush+0x1a7 at ffffffffc09455a7 [ib_ipoib]
#6 [ff646f1980153e58] __ipoib_ib_dev_flush+0x1a4 at ffffffffc09431a4 [ib_ipoib]
#7 [ff646f1980153e98] process_one_work+0x1a7 at ffffffff9bf10967

crash> rx ff646f1980153e68
ff646f1980153e68: ff1c6a1a04dc83f0 <<< work = &priv->flush_light

crash> ipoib_dev_priv.flush_light.func,broadcast ff1c6a1a04dc8000
flush_light.func = 0xffffffffc0943820 <ipoib_ib_dev_flush_light>,
broadcast = 0x0,

The mcast(s) on the `remove_list` (the remaining part of the ex `priv->multicast_list`):

crash> list -s ipoib_mcast.done.done ipoib_mcast.list -H ff646f1980153e10 | paste - -
ff1c6a192bd0c200 done.done = 0x0,
ff1c6a192d60ac00 done.done = 0x0,

Reported-by: Yuya Fujita-bishamonten <fj-lsoft-rh-driver@dl.jp.fujitsu.com>
Signed-off-by: Daniel Vacek <neelx@redhat.com>
Link: https://lore.kernel.org/all/20231212080746.1528802-1-neelx@redhat.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>


# 753fff78 21-Nov-2023 Jack Wang <jinpu.wang@ionos.com>

RDMA/IPoIB: Fix error code return in ipoib_mcast_join

Return the error code in case of ib_sa_join_multicast fail.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20231121130316.126364-2-jinpu.wang@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>


# 633d6102 28-Jan-2021 Christoph Lameter <cl@linux.com>

RDMA/ipoib: Remove racy Subnet Manager sendonly join checks

When a system receives a REREG event from the SM, then the SM information
in the kernel is marked as invalid and a request is sent to the SM to
update the information. The SM information is invalid in that time period.

However, receiving a REREG also occurs simultaneously in user space
applications that are now trying to rejoin the multicast groups. Some of
those may be sendonly multicast groups which are then failing.

If the SM information is invalid then ib_sa_sendonly_fullmem_support()
returns false. That is wrong because it just means that we do not know yet
if the potentially new SM supports sendonly joins.

Sendonly join was introduced in 2015 and all the Subnet managers have
supported it ever since. So there is no point in checking if a subnet
manager supports it.

Should an old opensm get a request for a sendonly join then the request
will fail. The code that is removed here accomodated that situation and
fell back to a full join.

Falling back to a full join is problematic in itself. The reason to use
the sendonly join was to reduce the traffic on the Infiniband fabric
otherwise one could have just stayed with the regular join. So this patch
may cause users of very old opensms to discover that lots of traffic
needlessly crosses their IB fabrics.

Link: https://lore.kernel.org/r/alpine.DEB.2.22.394.2101281845160.13303@www.lameter.com
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 90eef9f7 14-Dec-2020 Zheng Yongjun <zhengyongjun3@huawei.com>

RDMA: Convert comma to semicolon

Replace a comma between expression statements by a semicolon.

Link: https://lore.kernel.org/r/20201214134118.4349-1-zhengyongjun3@huawei.com
Link: https://lore.kernel.org/r/20201214134146.4456-1-zhengyongjun3@huawei.com
Link: https://lore.kernel.org/r/20201214134218.4510-1-zhengyongjun3@huawei.com
Link: https://lore.kernel.org/r/20201214134243.4563-1-zhengyongjun3@huawei.com
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>


# 90cdff90 21-Jun-2020 Kamal Heib <kamalheib1@gmail.com>

RDMA/ipoib: Return void from ipoib_mcast_stop_thread()

The return value from ipoib_mcast_stop_thread() is always 0 - change it to
be void.

Link: https://lore.kernel.org/r/20200622092256.6931-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# ebd6e96b 25-May-2020 Kamal Heib <kamalheib1@gmail.com>

RDMA/ipoib: Remove can_sleep parameter from iboib_mcast_alloc

can_sleep is always 0 when iboib_mcast_alloc() is called, so remove it and
use GFP_ATOMIC instead of GFP_KERNEL.

Link: https://lore.kernel.org/r/20200525130305.171509-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# b7e159eb 10-May-2020 Gary Leshner <Gary.S.Leshner@intel.com>

IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size

When in connected mode ipoib sent broadcast pings which exceeded the mtu
size for broadcast addresses.

Add an mtu attribute to the rdma_netdev structure which ipoib sets to its
mcast mtu size.

The RDMA netdev uses this value to determine if the skb length is too long
for the mtu specified and if it is, drops the packet and logs an error
about the errant packet.

Link: https://lore.kernel.org/r/20200511160655.173205.14546.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 6d72344c 10-May-2020 Kaike Wan <kaike.wan@intel.com>

IB/ipoib: Increase ipoib Datagram mode MTU's upper limit

Currently the ipoib UD mtu is restricted to 4K bytes. Remove this
limitation so that the IPOIB module can potentially use an MTU (in UD
mode) that is bounded by the MTU of the underlying device. A field is
added to the ib_port_attr structure to indicate the maximum physical
MTU the underlying device supports.

Link: https://lore.kernel.org/r/20200511160618.173205.23053.stgit@awfm-01.aw.intel.com
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 75da9606 29-Jul-2018 Denis Drozdov <denisd@mellanox.com>

IB/IPoIB: Set ah valid flag in multicast send flow

The change of ipoib_ah data structure with adding "valid" flag and
checks of ah->valid in ipoib_start_xmit affected multicast packet flow.

Since the multicast flow doesn't invoke path_rec_start, "ah->valid" flag
remains unset, so that ipoib_start_xmit end up with neigh_refresh_path
instead of sending the packet using neigh.

"ah->valid" has to be set in multicast send flow. As a result IPoIB
starts sending packets via neigh immediately and eliminates 60sec delay
of neigh keep alive interval.

The typical example of this issue are two sequential arpings:

arping 11.134.208.9 -> got response (mcast_send)
arping 11.134.208.9 -> no response (ah->valid = 0)

Fixes: fa9391dbad4b ("RDMA/ipoib: Update paths on CLIENT_REREG/SM_CHANGE events")
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# b1b63970 04-Jul-2018 Kamal Heib <kamalheib1@gmail.com>

RDMA/ipoib: Fix use of sizeof()

Make sure to use sizeof(...) instead of sizeof ... which is more
preferred.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 112f5c81 16-May-2018 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

IB/ipoib: replace local_irq_disable() with proper locking

In ipoib_mcast_restart_task() the netif_addr_lock() is invoked prior
local_irq_save(). netif_addr_lock() should not be invoked in interrupt disabled
section, only in BH disabled sections.
The priv->lock is always acquired with disabled interrupts. The only place
where netif_addr_lock() and priv->lock nest ist ipoib_mcast_restart_task().

Drop the local_irq_save() and acquire priv->lock with spin_lock_irq() inside
the netif_addr locked section. It's safe to do so because the caller is either
a worker function or __ipoib_ib_dev_flush() which are both calling with
interrupts enabled (and since BH is enabled here, too so
netif_addr_lock_bh() needs to be used).

Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# 16ba3def 31-Dec-2017 Erez Shitrit <erezsh@mellanox.com>

IB/ipoib: Fix race condition in neigh creation

When using enhanced mode for IPoIB, two threads may execute xmit in
parallel to two different TX queues while the target is the same.
In this case, both of them will add the same neighbor to the path's
neigh link list and we might see the following message:

list_add double add: new=ffff88024767a348, prev=ffff88024767a348...
WARNING: lib/list_debug.c:31__list_add_valid+0x4e/0x70
ipoib_start_xmit+0x477/0x680 [ib_ipoib]
dev_hard_start_xmit+0xb9/0x3e0
sch_direct_xmit+0xf9/0x250
__qdisc_run+0x176/0x5d0
__dev_queue_xmit+0x1f5/0xb10
__dev_queue_xmit+0x55/0xb10

Analysis:
Two SKB are scheduled to be transmitted from two cores.
In ipoib_start_xmit, both gets NULL when calling ipoib_neigh_get.
Two calls to neigh_add_path are made. One thread takes the spin-lock
and calls ipoib_neigh_alloc which creates the neigh structure,
then (after the __path_find) the neigh is added to the path's neigh
link list. When the second thread enters the critical section it also
calls ipoib_neigh_alloc but in this case it gets the already allocated
ipoib_neigh structure, which is already linked to the path's neigh
link list and adds it again to the list. Which beside of triggering
the list, it creates a loop in the linked list. This loop leads to
endless loop inside path_rec_completion.

Solution:
Check list_empty(&neigh->list) before adding to the list.
Add a similar fix in "ipoib_multicast.c::ipoib_mcast_send"

Fixes: b63b70d87741 ('IPoIB: Use a private hash table for path lookup in xmit path')
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>


# a08e1120 12-Jul-2017 Erez Shitrit <erezsh@mellanox.com>

IB/ipoib: Make sure no in-flight joins while leaving that mcast

While cleaning neighs and there is a send-only mcast neigh, the driver
should wait to finish its join process before trying to remove it.

Without this patch, we will see messages like: "ipoib_mcast_leave on an
in-flight join" and unexpected results in the join_complete.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>


# 6bdc8de2 12-Jul-2017 Erez Shitrit <erezsh@mellanox.com>

IB/ipoib: Use cancel_delayed_work_sync when needed

The work mcast_task can re-queue itself, so instead of doing
cancel && flush_workqueue, that still can leave a queued task
on the air, use cancel_delayed_work_sync.

Also, no need to use lock over the cancel, the original lock was
due to bit assignment setting (IPOIB_MCAST_RUN) that is not in use
anymore.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>


# edf3f301 10-Jul-2017 Feras Daoud <ferasda@mellanox.com>

IB/ipoib: Fix race between light events and interface restart

A potential race between light_event and interface restart
may attach multicast group to an already attached QP.

Scenario:
light_event flow goes through ipoib_mcast_dev_flush function,
if a context switch occurs before calling ipoib_mcast_remove_list,
then we may face a situation where the broadcast of the priv is null
and the corresponding QP is not detached yet.
If an "interface restart" runs during the previous context switch,
the following scenario occurs:
When the device goes up, ipoib_ib_dev_up function will be called,
it will send a new registration request to the broadcast group and then
attach the group to the QP that was not detached before.

IPOIB_FLUSH_LIGHT INTERFACE RESTART

__ipoib_ib_dev_flush |
| |
| |
| |
ipoib_mcast_dev_flush |
Move mcast list and broadcast to remove_list |
| |
| |
Context Switch--> |
| ipoib_ib_dev_down
| |
| |
| ipoib_ib_dev_up
| |
| |
| ipoib_mcast_join_task
| allocate new broadcast
| |
| |
| Attach QP to multicast group
| |
| |
| <--Context Switch
ipoib_mcast_leave
Detach QP from multicast group

Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>


# 44c58487 29-Apr-2017 Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>

IB/core: Define 'ib' and 'roce' rdma_ah_attr types

rdma_ah_attr can now be either ib or roce allowing
core components to use one type or the other and also
to define attributes unique to a specific type. struct
ib_ah is also initialized with the type when its first
created. This ensures that calls such as modify_ah
dont modify the type of the address handle attribute.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# d8966fcd 29-Apr-2017 Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>

IB/core: Use rdma_ah_attr accessor functions

Modify core and driver components to use accessor functions
introduced to access individual fields of rdma_ah_attr

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 90898850 29-Apr-2017 Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>

IB/core: Rename struct ib_ah_attr to rdma_ah_attr

This patch simply renames struct ib_ah_attr to
rdma_ah_attr as these fields specify attributes that are
not necessarily specific to IB.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# cfd51935 29-Apr-2017 Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>

IB/IPoIB: Remove 'else' when the 'if' has a return.

This patch fixes a checkpatch issue related to not having
to use an 'else' if the 'if' path returns from the function.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# ee1c60b1 20-Mar-2017 Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>

IB/SA: Modify SA to implicitly cache Class Port info

SA will query and cache class port info as part of
its initialization. SA will also invalidate and
refresh the cache based on specific events. Callers such
as IPoIB and CM can query the SA to get the classportinfo
information. Apart from making the caller code much simpler,
this change puts the onus on the SA to query and maintain
classportinfo much like how it maitains the address handle to the SM.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 3e31a490 19-Mar-2017 Feras Daoud <ferasda@mellanox.com>

IB/ipoib: Fix deadlock between ipoib_stop and mcast join flow

Before calling ipoib_stop, rtnl_lock should be taken, then
the flow clears the IPOIB_FLAG_ADMIN_UP and IPOIB_FLAG_OPER_UP
flags, and waits for mcast completion if IPOIB_MCAST_FLAG_BUSY
is set.

On the other hand, the flow of multicast join task initializes
a mcast completion, sets the IPOIB_MCAST_FLAG_BUSY and calls
ipoib_mcast_join. If IPOIB_FLAG_OPER_UP flag is not set, this
call returns EINVAL without setting the mcast completion and
leads to a deadlock.

ipoib_stop |
| |
clear_bit(IPOIB_FLAG_ADMIN_UP) |
| |
Context Switch |
| ipoib_mcast_join_task
| |
| spin_lock_irq(lock)
| |
| init_completion(mcast)
| |
| set_bit(IPOIB_MCAST_FLAG_BUSY)
| |
| Context Switch
| |
clear_bit(IPOIB_FLAG_OPER_UP) |
| |
spin_lock_irqsave(lock) |
| |
Context Switch |
| ipoib_mcast_join
| return (-EINVAL)
| |
| spin_unlock_irq(lock)
| |
| Context Switch
| |
ipoib_mcast_dev_flush |
wait_for_completion(mcast) |

ipoib_stop will wait for mcast completion for ever, and will
not release the rtnl_lock. As a result panic occurs with the
following trace:

[13441.639268] Call Trace:
[13441.640150] [<ffffffff8168b579>] schedule+0x29/0x70
[13441.641038] [<ffffffff81688fc9>] schedule_timeout+0x239/0x2d0
[13441.641914] [<ffffffff810bc017>] ? complete+0x47/0x50
[13441.642765] [<ffffffff810a690d>] ? flush_workqueue_prep_pwqs+0x16d/0x200
[13441.643580] [<ffffffff8168b956>] wait_for_completion+0x116/0x170
[13441.644434] [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
[13441.645293] [<ffffffffa05af170>] ipoib_mcast_dev_flush+0x150/0x190 [ib_ipoib]
[13441.646159] [<ffffffffa05ac967>] ipoib_ib_dev_down+0x37/0x60 [ib_ipoib]
[13441.647013] [<ffffffffa05a4805>] ipoib_stop+0x75/0x150 [ib_ipoib]

Fixes: 08bc327629cb ("IB/ipoib: fix for rare multicast join race condition")
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# cd565b4b 10-Apr-2017 Erez Shitrit <erezsh@mellanox.com>

IB/IPoIB: Support acceleration options callbacks

IPoIB driver now uses the new set of callback functions.

If the hardware provider supports the new ipoib_options implementation,
the driver uses the callbacks in its data path flows, otherwise it uses the
driver default implementation for all data flows in its code.

The default implementation wasn't change and it is exactly as it was before
introduction of acceleration support.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# c1048aff 10-Apr-2017 Erez Shitrit <erezsh@mellanox.com>

IB/IPoIB: Use defined function for netdev_priv function

Make ipoib_priv point to netdev_priv where the code calls netdev_priv.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 5c37077f 18-Jan-2017 Zhu Yanjun <yanjun.zhu@oracle.com>

IB/ipoib: Remove the unnecessary error check

The function ipoib_mcast_start_thread/ipoib_ib_dev_up always return zero.
As such, in the function ipoib_open, err_stop will never be reached.
So remove this err_stop and change the return type of the function
ipoib_mcast_start_thread/ipoib_ib_dev_up to void.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# d32b9a81 28-Dec-2016 Feras Daoud <ferasda@mellanox.com>

IB/ipoib: Add detailed error message to dev_queue_xmit call

Add a detailed return code to dev_queue_xmit function when
calling to requeue packet via __skb_dequeue.

Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 11b642b8 21-Nov-2016 Bart Van Assche <bvanassche@acm.org>

IPoIB: Avoid reading an uninitialized member variable

This patch avoids that Coverity reports the following:

Using uninitialized value port_attr.state when calling printk

Fixes: commit 94232d9ce817 ("IPoIB: Start multicast join process only on active ports")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Erez Shitrit <erezsh@mellanox.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# fc791b63 13-Oct-2016 Paolo Abeni <pabeni@redhat.com>

IB/ipoib: move back IB LL address into the hard header

After the commit 9207f9d45b0a ("net: preserve IP control block
during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
That destroy the IPoIB address information cached there,
causing a severe performance regression, as better described here:

http://marc.info/?l=linux-kernel&m=146787279825501&w=2

This change moves the data cached by the IPoIB driver from the
skb control lock into the IPoIB hard header, as done before
the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
and use skb->cb to stash LL addresses").
In order to avoid GRO issue, on packet reception, the IPoIB driver
stash into the skb a dummy pseudo header, so that the received
packets have actually a hard header matching the declared length.
To avoid changing the connected mode maximum mtu, the allocated
head buffer size is increased by the pseudo header length.

After this commit, IPoIB performances are back to pre-regression
value.

v2 -> v3: rebased
v1 -> v2: avoid changing the max mtu, increasing the head buf size

Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9b29953b 04-Jun-2016 Mark Bloch <markb@mellanox.com>

IB/IPoIB: Disable bottom half when dealing with device address

Align locking usage when touching device address with rest
of the kernel. Lock the bottom half when doing so using
netif_addr_lock_bh.

This also solves the following case as reported by lockdep:
CPU0 CPU1
---- ----
lock(_xmit_INFINIBAND);
local_irq_disable();
lock(&(&mc->mca_lock)->rlock);
lock(_xmit_INFINIBAND);
<Interrupt>
lock(&(&mc->mca_lock)->rlock);

*** DEADLOCK ***

Fixes: 492a7e67ff83 ("IB/IPoIB: Allow setting the device address")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 492a7e67 18-May-2016 Mark Bloch <markb@mellanox.com>

IB/IPoIB: Allow setting the device address

In IB networks, and specifically in IPoIB/rdmacm traffic, the device
address of an IPoIB interface is used as a means to exchange information
between nodes needed for communication.

Currently an IPoIB interface will always be created with a device
address based on its node GUID without a way to change that.

This change adds the ability to set the device address of an IPoIB
interface by value. We use the set mac address ndo to do that.

The flow should be broken down to two:
1) The GID value is already in the GID table,
in this case the interface will be able to set carrier up.

2) The GID value is not yet in the GID table,
in this case the interface won't try to join the multicast group
and will wait (listen on GID_CHANGE event) until the GID is inserted.

In order to track those changes, we add a new flag:
* IPOIB_FLAG_DEV_ADDR_SET.

When set, it means the dev_addr is a based on a value in the gid
table. this bit will be cleared upon a dev_addr change triggered
by the user and set after validation.

Per IB spec the port GUID can't change if the module is loaded.
port GUID is the basis for GID at index 0 which is the basis for
the default device address of a ipoib interface.

The issue is that there are devices that don't follow the spec,
they change the port GUID while HCA is powered on, so in order
not to break userspace applications. We need to check if the
user wanted to control the device address and we assume that
if he sets the device address back to be based on GID index 0,
he no longer wishs to control it.

In order to track this, we add an additional flag:
* IPOIB_FLAG_DEV_ADDR_CTRL

When setting the device address, there is no validation of the upper
twelve bytes of the device address (flags, qpn, subnet prefix) as those
bytes are not under the control of the user.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 3b561130 25-May-2016 Erez Shitrit <erezsh@mellanox.com>

IB/ipoib: Support SendOnlyFullMember MCG for SendOnly join

Check (via an SA query) if the SM supports the new option for SendOnly
multicast joins.
If the SM supports that option it will use the new join state to create
such multicast group.
If SendOnlyFullMember is supported, we wouldn't use faked FullMember state
join for SendOnly MCG, use the correct state if supported.

This check is performed at every invocation of mcast_restart task, to be
sure that the driver stays in sync with the current state of the SM.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 08bc3276 11-Feb-2016 Alex Estrin <alex.estrin@intel.com>

IB/ipoib: fix for rare multicast join race condition

A narrow window for race condition still exist between
multicast join thread and *dev_flush workers.
A kernel crash caused by prolong erratic link state changes
was observed (most likely a faulty cabling):

[167275.656270] BUG: unable to handle kernel NULL pointer dereference at
0000000000000020
[167275.665973] IP: [<ffffffffa05f8f2e>] ipoib_mcast_join+0xae/0x1d0 [ib_ipoib]
[167275.674443] PGD 0
[167275.677373] Oops: 0000 [#1] SMP
...
[167275.977530] Call Trace:
[167275.982225] [<ffffffffa05f92f0>] ? ipoib_mcast_free+0x200/0x200 [ib_ipoib]
[167275.992024] [<ffffffffa05fa1b7>] ipoib_mcast_join_task+0x2a7/0x490
[ib_ipoib]
[167276.002149] [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[167276.010754] [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[167276.019088] [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[167276.027737] [<ffffffff810a5aef>] kthread+0xcf/0xe0
Here was a hit spot:
ipoib_mcast_join() {
..............
rec.qkey = priv->broadcast->mcmember.qkey;
^^^^^^^
.....
}
Proposed patch should prevent multicast join task to continue
if link state change is detected.

Signed-off-by: Alex Estrin <alex.estrin@intel.com>

Changes from v4:
- as suggested by Doug Ledford, optimized spinlock usage,
i.e. ipoib_mcast_join() is called with lock held.
Changes from v3:
- sync with priv->lock before flag check.
Chages from v2:
- Move check for OPER_UP flag state to mcast_join() to
ensure no event worker is in progress.
- minor style fixes.
Changes from v1:
- No need to lock again if error detected.
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 50be28de 07-Jan-2016 Erez Shitrit <erezsh@mellanox.com>

IB/IPoIB: Fix kernel panic on multicast flow

ipoib_mcast_restart_task calls ipoib_mcast_remove_list with the
parameter mcast->dev. That mcast is a temporary (used as an iterator)
variable that may be uninitialized.
There is no need to send the variable dev to the function, as each mcast
has its dev as a member in the mcast struct.

This causes the next panic:
RIP: 0010: ipoib_mcast_leave+0x6d/0xf0 [ib_ipoib]
RSP: 0018: EFLAGS: 00010246
RAX: f0201 RBX: 24e00 RCX: 00000
....
....
Stack:
Call Trace:
ipoib_mcast_remove_list+0x3a/0x70 [ib_ipoib]
ipoib_mcast_restart_task+0x3bb/0x520 [ib_ipoib]
process_one_work+0x164/0x470
worker_thread+0x11d/0x420
...

Fixes: 5a0e81f6f483 ('IB/IPoIB: factor out common multicast list removal code')
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reported-by: Doron Tsur <doront@mellanox.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 432c55ff 21-Dec-2015 Christoph Lameter <cl@linux.com>

IB/IPoIB: Move multicast specific code out of ipoib_main.c

Code cleanup to move multicast specific code that checks for
a sendonly join to ipoib_multicast.c. This allows the removal
of the export of __ipoib_mcast_find().

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 5a0e81f6 21-Dec-2015 Christoph Lameter <cl@linux.com>

IB/IPoIB: factor out common multicast list removal code

Code cleanup to remove multicast specific code from ipoib_main.c

The removal of a list of multicast groups occurs in three places.
Create a new function ipoib_mcast_remove_list(). Use this new
function in ipoib_main.c too.
That in turn allows the dropping of two functions that were
exported from ipoib_multicast.c for expiration of mc groups.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 55ee3ab2 15-Oct-2015 Matan Barak <matanb@mellanox.com>

IB/core: Add netdev and gid attributes paramteres to cache

Adding an ability to query the IB cache by a netdev and get the
attributes of a GID. These parameters are necessary in order to
successfully resolve the required GID (when the netdevice is known)
and get the Ethernet L2 attributes from a GID.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 0b5c9279 11-Oct-2015 Christoph Lameter <cl@linux.com>

IB/ipoib: For sendonly join free the multicast group on leave

When we leave the multicast group on expiration of a neighbor we
do not free the mcast structure. This results in a memory leak
that causes ib_dealloc_pd to fail and print a WARN_ON message
and backtrace.

Fixes: bd99b2e05c4d (IB/ipoib: Expire sendonly multicast joins)
Signed-off-by: Christoph Lameter <cl@linux.com>
Tested-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# e622f2f4 08-Oct-2015 Christoph Hellwig <hch@lst.de>

IB: split struct ib_send_wr

This patch split up struct ib_send_wr so that all non-trivial verbs
use their own structure which embedds struct ib_send_wr. This dramaticly
shrinks the size of a WR for most common operations:

sizeof(struct ib_send_wr) (old): 96

sizeof(struct ib_send_wr): 48
sizeof(struct ib_rdma_wr): 64
sizeof(struct ib_atomic_wr): 96
sizeof(struct ib_ud_wr): 88
sizeof(struct ib_fast_reg_wr): 88
sizeof(struct ib_bind_mw_wr): 96
sizeof(struct ib_sig_handover_wr): 80

And with Sagi's pending MR rework the fast registration WR will also be
down to a reasonable size:

sizeof(struct ib_fastreg_wr): 64

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt]
Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc]
Tested-by: Haggai Eran <haggaie@mellanox.com>
Tested-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>


# c3852ab0 25-Sep-2015 Doug Ledford <dledford@redhat.com>

IB/ipoib: Make sendonly multicast joins create the mcast group

Since IPoIB should, as much as possible, emulate how multicast
sends work on Ethernet for regular TCP/IP apps, there should be
no requirement to subscribe to a multicast group before your
sends are properly sent. However, due to the difference in how
multicast is handled on InfiniBand, we must join the appropriate
multicast group before we can send to it. Previously we tried
not to trigger the auto-create feature of the subnet manager when
doing this because we didn't have tracking of these sendonly
groups and the auto-creation might never get undone. The previous
patch added timing to these sendonly joins and allows us to
leave them after a reasonable idle expiration time. So supply
all of the information needed to auto-create group.

Signed-off-by: Doug Ledford <dledford@redhat.com>


# bd99b2e0 23-Sep-2015 Christoph Lameter <cl@linux.com>

IB/ipoib: Expire sendonly multicast joins

On neighbor expiration, check to see if the neighbor was actually a
sendonly multicast join, and if so, leave the multicast group as we
expire the neighbor.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# d1178cbc 21-Aug-2015 Jason Gunthorpe <jgg@ziepe.ca>

IB/ipoib: Suppress warning for send only join failures

We expect send only joins to fail, it just means there are no listeners
for the group. The correct thing to do is silently drop the packet
at source.

Eg avahi will full join 224.0.0.251 which causes a send only IGMP packet
to 224.0.0.22, and then a warning level kmessage like this:

ib0: sendonly multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0016, status -22

If there is no IP router listening to IGMP.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# c3acdc06 03-Sep-2015 Doug Ledford <dledford@redhat.com>

IB/ipoib: Clean up send-only multicast joins

Even though we don't expect the group to be created by the SM we
sill need to provide all the parameters to force the SM to validate
they are correct.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 0e5544d9 02-Apr-2015 Erez Shitrit <erezsh@mellanox.com>

IB/ipoib: Remove IPOIB_MCAST_RUN bit

After Doug Ledford's changes there is no need in that bit, it's
semantic becomes subset of the IPOIB_FLAG_OPER_UP bit.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 3fd0605c 02-Apr-2015 Erez Shitrit <erezsh@mellanox.com>

IB/ipoib: Update broadcast record values after each successful join request

Update the cached broadcast record in the priv object after every new
join of this broadcast domain group.

These values are needed for the port configuration (MTU size) and to
all the new multicast (non-broadcast) join requests initial parameters.

For example, SM starts with 2K MTU for all the fabric, and after that it
restarts (or handover to new SM) with new port configuration of 4K MTU.
Without using the new values, the driver will keep its old configuration
of 2K and will not apply the new configuration of 4K.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 1c0453d6 21-Feb-2015 Doug Ledford <dledford@redhat.com>

IB/ipoib: drop mcast_mutex usage

We needed the mcast_mutex when we had to prevent the join completion
callback from having the value it stored in mcast->mc overwritten
by a delayed return from ib_sa_join_multicast. By storing the return
of ib_sa_join_multicast in an intermediate variable, we prevent a
delayed return from ib_sa_join_multicast overwriting the valid
contents of mcast->mc, and we no longer need a mutex to force the
join callback to run after the return of ib_sa_join_multicast. This
allows us to do away with the mutex entirely and protect our critical
sections with a just a spinlock instead. This is highly desirable
as there were some places where we couldn't use a mutex because the
code was not allowed to sleep, and so we were currently using a mix
of mutex and spinlock to protect what we needed to protect. Now we
only have a spin lock and the locking complexity is greatly reduced.

Signed-off-by: Doug Ledford <dledford@redhat.com>


# d2fe937c 21-Feb-2015 Doug Ledford <dledford@redhat.com>

IB/ipoib: deserialize multicast joins

Allow the ipoib layer to attempt to join all outstanding multicast
groups at once. The ib_sa layer will serialize multiple attempts to
join the same group, but will process attempts to join different groups
in parallel. Take advantage of that.

In order to make this happen, change the mcast_join_thread to loop
through all needed joins, sending a join request for each one that we
still need to join. There are a few special cases we handle though:

1) Don't attempt to join anything but the broadcast group until the join
of the broadcast group has succeeded.
2) No longer restart the join task at the end of completion handling.
If we completed successfully, we are done. The join task now needs kicked
either by mcast_send or mcast_restart_task or mcast_start_thread, but
should not need started anytime else except when scheduling a backoff
attempt to rejoin.
3) No longer use separate join/completion routines for regular and
sendonly joins, pass them all through the same routine and just do the
right thing based on the SENDONLY join flag.
4) Only try to join a SENDONLY join twice, then drop the packets and
quit trying. We leave the mcast group in the list so that if we get a
new packet, all that we have to do is queue up the packet and restart
the join task and it will automatically try to join twice and then
either send or flush the queue again.

Signed-off-by: Doug Ledford <dledford@redhat.com>


# 69911416 21-Feb-2015 Doug Ledford <dledford@redhat.com>

IB/ipoib: fix MCAST_FLAG_BUSY usage

Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used. We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it. And when we did
complete it, sometimes we continued to touch the mcast entry after
the completion, opening us up to possible use after free issues.

This made it less than totally effective, and certainly made its use
confusing. And in the flush function we would use the presence of this
flag to signal that we should wait on the completion struct, but we never
cleared this flag, ever.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

1) Remove the MCAST_JOIN_STARTED flag entirely
2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
3) Test mcast->mc directly to see if we have completed
ib_sa_join_multicast (using IS_ERR_OR_NULL)
4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
the mcast->done completion struct
5) Make sure that before calling complete(&mcast->done), we always clear
the MCAST_FLAG_BUSY bit
6) Take the mcast_mutex before we call ib_sa_multicast_join and also
take the mutex in our join callback. This forces
ib_sa_multicast_join to return and set mcast->mc before we process
the callback. This way, our callback can safely clear mcast->mc
if there is an error on the join and we will do the right thing as
a result in mcast_dev_flush.
7) Because we need the mutex to synchronize mcast->mc, we can no
longer call mcast_sendonly_join directly from mcast_send and
instead must add sendonly join processing to the mcast_join_task
8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
we have a running task. We know when we need to reschedule our
join task thread and don't need a flag to tell us.
9) Add a helper for rescheduling the join task thread

A number of different races are resolved with these changes. These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1 Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
alloc member
call callback
ifconfig ib0 down
wait_for_completion
callback call completes
wait_for_completion in
mcast_dev_flush completes
mcast->mc is PTR_ERR_OR_NULL
so we skip ib_sa_leave_multicast
return from callback
return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1 Thread 2 Thread 3
ib_sa_multicast_join
ifconfig ib0 down
priv->broadcast = NULL
join_complete
wait_for_completion
mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
complete
return -EAGAIN (making mcast->mc invalid)
call ib_sa_multicast_leave
on invalid mcast->mc, hang
forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we
run the callback. This allows us to clear mcast->mc if there is an
error and the join is going to fail. We do this before we complete
the mcast. In this way, mcast_dev_flush always sees consistent state
in regards to mcast->mc membership at the time that the
wait_for_completion() returns.

Signed-off-by: Doug Ledford <dledford@redhat.com>


# efc82eee 21-Feb-2015 Doug Ledford <dledford@redhat.com>

IB/ipoib: No longer use flush as a parameter

Various places in the IPoIB code had a deadlock related to flushing
the ipoib workqueue. Now that we have per device workqueues and a
specific flush workqueue, there is no longer a deadlock issue with
flushing the device specific workqueues and we can do so unilaterally.

Signed-off-by: Doug Ledford <dledford@redhat.com>


# 0b39578b 21-Feb-2015 Doug Ledford <dledford@redhat.com>

IB/ipoib: Use dedicated workqueues per interface

During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up. It turns
out that this is because no matter how well we protect against races
on a single device, the fact that all devices use the same workqueue,
and flush_workqueue() flushes *everything* from that workqueue means
that we would also have to prevent all races between different devices
(for instance, ipoib_mcast_restart_task on interface ib0 can race with
ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
the rtnl_lock).

There are several possible solutions to this problem:

Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail. This runs the
real risk of dropping work on the floor, which can end up being its
own separate kind of deadlock.

Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail. Again, this can
drop work on the floor.

Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface. In addition, keep the global
workqueue, but now limit it to only flush tasks. In this way, the
flush tasks can always flush the device specific work queues without
having deadlock issues.

Signed-off-by: Doug Ledford <dledford@redhat.com>


# 894021a7 21-Feb-2015 Doug Ledford <dledford@redhat.com>

IB/ipoib: Make the carrier_on_task race aware

We blindly assume that we can just take the rtnl lock and that will
prevent races with downing this interface. Unfortunately, that's not
the case. In ipoib_mcast_stop_thread() we will call flush_workqueue()
in an attempt to clear out all remaining instances of ipoib_join_task.
But, since this task is put on the same workqueue as the join task,
the flush_workqueue waits on this thread too. But this thread is
deadlocked on the rtnl lock. The better thing here is to use trylock
and loop on that until we either get the lock or we see that
FLAG_OPER_UP has been cleared, in which case we don't need to do
anything anyway and we just return.

While investigating which flag should be used, FLAG_ADMIN_UP or
FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more
appropriate flag to use. However, there was a mix of these two flags in
use in the existing code. So while we check for that flag here as part
of this race fix, also cleanup the two places that had used the less
appropriate flag for their tests.

Signed-off-by: Doug Ledford <dledford@redhat.com>


# c84ca6d2 21-Feb-2015 Doug Ledford <dledford@redhat.com>

IB/ipoib: Consolidate rtnl_lock tasks in workqueue

The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and
needs to keep it held. It also needs to call flush_workqueue() to flush
out any outstanding work. In the past, we've had to try and make sure
that we didn't flush out any outstanding join completions because they
also wanted to grab rtnl_lock() and that would deadlock. It turns out
that the only thing in the join completion handler that needs this lock
can be safely moved to our carrier_on_task, thereby reducing the
potential for the join completion code and the flush code to deadlock
against each other.

Signed-off-by: Doug Ledford <dledford@redhat.com>


# 4143a951 30-Jan-2015 Roland Dreier <roland@purestorage.com>

Revert "IPoIB: Consolidate rtnl_lock tasks in workqueue"

This reverts commit afe1de664ef3cb756e70938d99417dcbc6b1379a.

The series of IPoIB bug fixes that went into 3.19-rc1 introduce
regressions, and after trying to sort things out, we decided to revert
to 3.18's IPoIB driver and get things right for 3.20.

Signed-off-by: Roland Dreier <roland@purestorage.com>


# c6a7ec7a 30-Jan-2015 Roland Dreier <roland@purestorage.com>

Revert "IPoIB: Make the carrier_on_task race aware"

This reverts commit 67d7209e1f481cbaed37f9a224a328a3f83d0482.

The series of IPoIB bug fixes that went into 3.19-rc1 introduce
regressions, and after trying to sort things out, we decided to revert
to 3.18's IPoIB driver and get things right for 3.20.

Signed-off-by: Roland Dreier <roland@purestorage.com>


# e7a623d2 30-Jan-2015 Roland Dreier <roland@purestorage.com>

Revert "IPoIB: fix MCAST_FLAG_BUSY usage"

This reverts commit 016d9fb25cd9817ea9c723f4f7ecd978636b4489.

The series of IPoIB bug fixes that went into 3.19-rc1 introduce
regressions, and after trying to sort things out, we decided to revert
to 3.18's IPoIB driver and get things right for 3.20.

Signed-off-by: Roland Dreier <roland@purestorage.com>


# 962121b4 30-Jan-2015 Roland Dreier <roland@purestorage.com>

Revert "IPoIB: fix mcast_dev_flush/mcast_restart_task race"

This reverts commit e5d1dcf1b0951f4ba00d93653942dda6196109d8.

The series of IPoIB bug fixes that went into 3.19-rc1 introduce
regressions, and after trying to sort things out, we decided to revert
to 3.18's IPoIB driver and get things right for 3.20.

Signed-off-by: Roland Dreier <roland@purestorage.com>


# 0306eda2 30-Jan-2015 Roland Dreier <roland@purestorage.com>

Revert "IPoIB: Use dedicated workqueues per interface"

This reverts commit 5141861cd5e17eac9676ff49c5abfafbea2b0e98.

The series of IPoIB bug fixes that went into 3.19-rc1 introduce
regressions, and after trying to sort things out, we decided to revert
to 3.18's IPoIB driver and get things right for 3.20.

Signed-off-by: Roland Dreier <roland@purestorage.com>


# 4e0ab200 30-Jan-2015 Roland Dreier <roland@purestorage.com>

Revert "IPoIB: Make ipoib_mcast_stop_thread flush the workqueue"

This reverts commit bb42a6dd02fb2901a69dbec2358810735b14b186.

The series of IPoIB bug fixes that went into 3.19-rc1 introduce
regressions, and after trying to sort things out, we decided to revert
to 3.18's IPoIB driver and get things right for 3.20.

Signed-off-by: Roland Dreier <roland@purestorage.com>


# bb42a6dd 10-Dec-2014 Doug Ledford <dledford@redhat.com>

IPoIB: Make ipoib_mcast_stop_thread flush the workqueue

We used to pass a flush variable to mcast_stop_thread to indicate if
we should flush the workqueue or not. This was due to some code
trying to flush a workqueue that it was currently running on which is
a no-no. Now that we have per-device work queues, and now that
ipoib_mcast_restart_task has taken the fact that it is queued on a
single thread workqueue with all of the ipoib_mcast_join_task's and
therefore has no need to stop the join task while it runs, we can do
away with the flush parameter and unilaterally flush always.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# 5141861c 10-Dec-2014 Doug Ledford <dledford@redhat.com>

IPoIB: Use dedicated workqueues per interface

During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up. It turns
out that this is because no matter how well we protect against races
on a single device, the fact that all devices use the same workqueue,
and flush_workqueue() flushes *everything* from that workqueue, we can
have one device in the middle of a down and holding the rtnl lock and
another totally unrelated device needing to run mcast_restart_task,
which wants the rtnl lock and will loop trying to take it unless is
sees its own FLAG_ADMIN_UP flag go away. Because the unrelated
interface will never see its own ADMIN_UP flag drop, the interface
going down will deadlock trying to flush the queue. There are several
possible solutions to this problem:

Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail. This runs the
real risk of dropping work on the floor, which can end up being its
own separate kind of deadlock.

Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail. Again, this can
drop work on the floor. I suppose if our own ADMIN_UP flag doesn't go
away, then maybe after a few tries on the rtnl lock we can queue our
own task back up as a delayed work and return and avoid dropping work
on the floor that way. But I'm not 100% convinced that we won't cause
other problems.

Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface. In addition, keep the global
workqueue, but now limit it to only flush tasks. In this way, the
flush tasks can always flush the device specific work queues without
having deadlock issues.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# e5d1dcf1 10-Dec-2014 Doug Ledford <dledford@redhat.com>

IPoIB: fix mcast_dev_flush/mcast_restart_task race

Our mcast_dev_flush routine and our mcast_restart_task can race
against each other. In particular, they both hold the priv->lock
while manipulating the rbtree and while removing mcast entries from
the multicast_list and while adding entries to the remove_list, but
they also both drop their locks prior to doing the actual removes.
The mcast_dev_flush routine is run entirely under the rtnl lock and so
has at least some locking. The actual race condition is like this:

Thread 1 Thread 2
ifconfig ib0 up
start multicast join for broadcast
multicast join completes for broadcast
start to add more multicast joins
call mcast_restart_task to add new entries
ifconfig ib0 down
mcast_dev_flush
mcast_leave(mcast A)
mcast_leave(mcast A)

As mcast_leave calls ib_sa_multicast_leave, and as member in
core/multicast.c is ref counted, we run into an unbalanced refcount
issue. To avoid stomping on each others removes, take the rtnl lock
specifically when we are deleting the entries from the remove list.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# 016d9fb2 10-Dec-2014 Doug Ledford <dledford@redhat.com>

IPoIB: fix MCAST_FLAG_BUSY usage

Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used. We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it. This made it less
than totally effective, and certainly made its use confusing. And in
the flush function we would use the presence of this flag to signal
that we should wait on the completion struct, but we never cleared
this flag, ever. This is further muddied by the fact that we overload
the MCAST_FLAG_BUSY flag to mean two different things: we have a join
in flight, and we have succeeded in getting an ib_sa_join_multicast.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

1) Remove the MCAST_JOIN_STARTED flag entirely
2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight
3) Test on mcast->mc directly to see if we have completed
ib_sa_join_multicast (using IS_ERR_OR_NULL)
4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
the mcast->done completion struct
5) Make sure that before calling complete(&mcast->done), we always clear
the MCAST_FLAG_BUSY bit
6) Take the mcast_mutex before we call ib_sa_multicast_join and also
take the mutex in our join callback. This forces
ib_sa_multicast_join to return and set mcast->mc before we process
the callback. This way, our callback can safely clear mcast->mc
if there is an error on the join and we will do the right thing as
a result in mcast_dev_flush.
7) Because we need the mutex to synchronize mcast->mc, we can no
longer call mcast_sendonly_join directly from mcast_send and
instead must add sendonly join processing to the mcast_join_task

A number of different races are resolved with these changes. These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1 Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
alloc member
call callback
ifconfig ib0 down
wait_for_completion
callback call completes
wait_for_completion in
mcast_dev_flush completes
mcast->mc is PTR_ERR_OR_NULL
so we skip ib_sa_leave_multicast
return from callback
return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1 Thread 2 Thread 3
ib_sa_multicast_join
ifconfig ib0 down
priv->broadcast = NULL
join_complete
wait_for_completion
mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
complete
return -EAGAIN (making mcast->mc invalid)
call ib_sa_multicast_leave
on invalid mcast->mc, hang
forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we
run the callback. This allows us to clear mcast->mc if there is an
error and the join is going to fail. We do this before we complete
the mcast. In this way, mcast_dev_flush always sees consistent state
in regards to mcast->mc membership at the time that the
wait_for_completion() returns.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# 67d7209e 10-Dec-2014 Doug Ledford <dledford@redhat.com>

IPoIB: Make the carrier_on_task race aware

We blindly assume that we can just take the rtnl lock and that will
prevent races with downing this interface. Unfortunately, that's not
the case. In ipoib_mcast_stop_thread() we will call flush_workqueue()
in an attempt to clear out all remaining instances of ipoib_join_task.
But, since this task is put on the same workqueue as the join task,
the flush_workqueue waits on this thread too. But this thread is
deadlocked on the rtnl lock. The better thing here is to use trylock
and loop on that until we either get the lock or we see that
FLAG_ADMIN_UP has been cleared, in which case we don't need to do
anything anyway and we just return.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# afe1de66 10-Dec-2014 Doug Ledford <dledford@redhat.com>

IPoIB: Consolidate rtnl_lock tasks in workqueue

Setting the MTU can safely be moved to the carrier_on_task, which keeps
us from needing to take the rtnl lock in the join_finish section.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# 68f9d83c 20-Aug-2014 Alex Estrin <alex.estrin@intel.com>

IPoIB: Remove unnecessary port query

There are two queries for port attributes one after another. A second
call is not needed since port_attr structure already holds the data.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# 94232d9c 16-Oct-2013 Erez Shitrit <erezsh@mellanox.com>

IPoIB: Start multicast join process only on active ports

The driver starts the mcast_join task whenever the netdev interface is
UP without relation to the underlying IB port state.

Until the port state is ACTIVE all the join requests are irrelevant,
and the IB core returns -EINVAL. So the user will see errors such as:
"multicast join failed for ff12:401b:... , status -22".

Instead, have ipoib_mcast_join_task() return when the port is not active.

It will be called again when the port state is changed and the
low-level driver triggers the IB_EVENT_PORT_ACTIVE event or the
IB_EVENT_CLIENT_REREGISTER event.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# a9c8ba58 16-Oct-2013 Erez Shitrit <erezsh@mellanox.com>

IPoIB: Fix usage of uninitialized multicast objects

The driver should avoid calling ib_sa_free_multicast on the mcast->mc
object until it finishes its initialization state. Otherwise we can
crash when ipoib_mcast_dev_flush() attempts to use the uninitialized
multicast object.

Instead, only call wait_for_completion() for multicast entries that
started the join process, meaning that ib_sa_join_multicast() finished.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# bea1e22d 30-Aug-2012 Patrick McHardy <kaber@trash.net>

IPoIB: Fix use-after-free of multicast object

Fix a crash in ipoib_mcast_join_task(). (with help from Or Gerlitz)

Commit c8c2afe360b7 ("IPoIB: Use rtnl lock/unlock when changing device
flags") added a call to rtnl_lock() in ipoib_mcast_join_task(), which
is run from the ipoib_workqueue, and hence the workqueue can't be
flushed from the context of ipoib_stop().

In the current code, ipoib_stop() (which doesn't flush the workqueue)
calls ipoib_mcast_dev_flush(), which goes and deletes all the
multicast entries. This takes place without any synchronization with
a possible running instance of ipoib_mcast_join_task() for the same
ipoib device, leading to a crash due to NULL pointer dereference.

Fix this by making sure that the workqueue is flushed before
ipoib_mcast_dev_flush() is called. To make that possible, we move the
RTNL-lock wrapped code to ipoib_mcast_join_finish().

Signed-off-by: Patrick McHardy <kaber@trash.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# b5120a6e 29-Aug-2012 Shlomo Pongratz <shlomop@mellanox.com>

IPoIB: Fix AB-BA deadlock when deleting neighbours

Lockdep points out a circular locking dependency betwwen the ipoib
device priv spinlock (priv->lock) and the neighbour table rwlock
(ntbl->rwlock).

In the normal path, ie neigbour garbage collection task, the neigh
table rwlock is taken first and then if the neighbour needs to be
deleted, priv->lock is taken.

However in some error paths, such as in ipoib_cm_handle_tx_wc(),
priv->lock is taken first and then ipoib_neigh_free routine is called
which in turn takes the neighbour table ntbl->rwlock.

The solution is to get rid the neigh table rwlock completely and use
only priv->lock.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# b63b70d8 24-Jul-2012 Shlomo Pongratz <shlomop@mellanox.com>

IPoIB: Use a private hash table for path lookup in xmit path

Dave Miller <davem@davemloft.net> provided a detailed description of
why the way IPoIB is using neighbours for its own ipoib_neigh struct
is buggy:

Any time an ipoib_neigh is changed, a sequence like the following is made:

spin_lock_irqsave(&priv->lock, flags);
/*
* It's safe to call ipoib_put_ah() inside
* priv->lock here, because we know that
* path->ah will always hold one more reference,
* so ipoib_put_ah() will never do more than
* decrement the ref count.
*/
if (neigh->ah)
ipoib_put_ah(neigh->ah);
list_del(&neigh->list);
ipoib_neigh_free(dev, neigh);
spin_unlock_irqrestore(&priv->lock, flags);
ipoib_path_lookup(skb, n, dev);

This doesn't work, because you're leaving a stale pointer to the freed up
ipoib_neigh in the special neigh->ha pointer cookie. Yes, it even fails
with all the locking done to protect _changes_ to *ipoib_neigh(n), and
with the code in ipoib_neigh_free() that NULLs out the pointer.

The core issue is that read side calls to *to_ipoib_neigh(n) are not
being synchronized at all, they are performed without any locking. So
whether we hold the lock or not when making changes to *ipoib_neigh(n)
you still can have threads see references to freed up ipoib_neigh
objects.

cpu 1 cpu 2
n = *ipoib_neigh()
*ipoib_neigh() = NULL
kfree(n)
n->foo == OOPS

[..]

Perhaps the ipoib code can have a private path database it manages
entirely itself, which holds all the necessary information and is
looked up by some generic key which is available easily at transmit
time and does not involve generic neighbour entries.

See <http://marc.info/?l=linux-rdma&m=132812793105624&w=2> and
<http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b>
for the full discussion.

This patch aims to solve the race conditions found in the IPoIB driver.

The patch removes the connection between the core networking neighbour
structure and the ipoib_neigh structure. In addition to avoiding the
race described above, it allows us to handle SKBs carrying IP packets
that don't have any associated neighbour.

We add an ipoib_neigh hash table with N buckets where the key is the
destination hardware address. The ipoib_neigh is fetched from the
hash table and instead of the stashed location in the neighbour
structure. The hash table uses both RCU and reference counting to
guarantee that no ipoib_neigh instance is ever deleted while in use.

Fetching the ipoib_neigh structure instance from the hash also makes
the special code in ipoib_start_xmit that handles remote and local
bonding failover redundant.

Aged ipoib_neigh instances are deleted by a garbage collection task
that runs every M seconds and deletes every ipoib_neigh instance that
was idle for at least 2*M seconds. The deletion is safe since the
ipoib_neigh instances are protected using RCU and reference count
mechanisms.

The number of buckets (N) and frequency of running the GC thread (M),
are taken from the exported arb_tbl.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# 700db99d 05-Jul-2012 David S. Miller <davem@davemloft.net>

ipoib: Need to do dst_neigh_lookup_skb() outside of priv->lock.

Otherwise local_bh_enable() complains.

Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 178709bb 02-Jul-2012 David S. Miller <davem@davemloft.net>

ipoib: Convert over to dev_lookup_neigh_skb().

Signed-off-by: David S. Miller <davem@davemloft.net>


# 936d7de3 07-Feb-2012 Roland Dreier <roland@purestorage.com>

IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses

Commit a0417fa3a18a ("net: Make qdisc_skb_cb upper size bound
explicit.") made it possible for a netdev driver to use skb->cb
between its header_ops.create method and its .ndo_start_xmit
method. Use this in ipoib_hard_header() to stash away the LL address
(GID + QPN), instead of the "ipoib_pseudoheader" hack. This allows
IPoIB to stop lying about its hard_header_len, which will let us fix
the L2 check for GRO.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 377cb4f9 07-Feb-2012 Roland Dreier <roland@purestorage.com>

IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses

Commit a0417fa3a18a ("net: Make qdisc_skb_cb upper size bound
explicit.") made it possible for a netdev driver to use skb->cb
between its header_ops.create method and its .ndo_start_xmit
method. Use this in ipoib_hard_header() to stash away the LL address
(GID + QPN), instead of the "ipoib_pseudoheader" hack. This allows
IPoIB to stop lying about its hard_header_len, which will let us fix
the L2 check for GRO.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 27217455 02-Dec-2011 David Miller <davem@davemloft.net>

net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}.

To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>


# 580da35a 29-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com>

IB: Fix RCU lockdep splats

Commit f2c31e32b37 ("net: fix NULL dereferences in check_peer_redir()")
forgot to take care of infiniband uses of dst neighbours.

Many thanks to Marc Aurele who provided a nice bug report and feedback.

Reported-by: Marc Aurele La France <tsi@ualberta.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# 3874397c 21-Nov-2011 Mike Marciniszyn <mike.marciniszyn@qlogic.com>

IB/ipoib: Prevent hung task or softlockup processing multicast response

This following can occur with ipoib when processing a multicast reponse:

BUG: soft lockup - CPU#0 stuck for 67s! [ib_mad1:982]
Modules linked in: ...
CPU 0:
Modules linked in: ...
Pid: 982, comm: ib_mad1 Not tainted 2.6.32-131.0.15.el6.x86_64 #1 ProLiant DL160 G5
RIP: 0010:[<ffffffff814ddb27>] [<ffffffff814ddb27>] _spin_unlock_irqrestore+0x17/0x20
RSP: 0018:ffff8802119ed860 EFLAGS: 00000246
0000000000000004 RBX: ffff8802119ed860 RCX: 000000000000a299
RDX: ffff88021086c700 RSI: 0000000000000246 RDI: 0000000000000246
RBP: ffffffff8100bc8e R08: ffff880210ac229c R09: 0000000000000000
R10: ffff88021278aab8 R11: 0000000000000000 R12: ffff8802119ed860
R13: ffffffff8100be6e R14: 0000000000000001 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006d4840 CR3: 0000000209aa5000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
[<ffffffffa032c247>] ? ipoib_mcast_send+0x157/0x480 [ib_ipoib]
[<ffffffff8100bc8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff8100bc8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffffa03283d4>] ? ipoib_path_lookup+0x124/0x2d0 [ib_ipoib]
[<ffffffffa03286fc>] ? ipoib_start_xmit+0x17c/0x430 [ib_ipoib]
[<ffffffff8141e758>] ? dev_hard_start_xmit+0x2c8/0x3f0
[<ffffffff81439d0a>] ? sch_direct_xmit+0x15a/0x1c0
[<ffffffff81423098>] ? dev_queue_xmit+0x388/0x4d0
[<ffffffffa032d6b7>] ? ipoib_mcast_join_finish+0x2c7/0x510 [ib_ipoib]
[<ffffffffa032dab8>] ? ipoib_mcast_sendonly_join_complete+0x1b8/0x1f0 [ib_ipoib]
[<ffffffffa02a0946>] ? mcast_work_handler+0x1a6/0x710 [ib_sa]
[<ffffffffa015f01e>] ? ib_send_mad+0xfe/0x3c0 [ib_mad]
[<ffffffffa00f6c93>] ? ib_get_cached_lmc+0xa3/0xb0 [ib_core]
[<ffffffffa02a0f9b>] ? join_handler+0xeb/0x200 [ib_sa]
[<ffffffffa029e4fc>] ? ib_sa_mcmember_rec_callback+0x5c/0xa0 [ib_sa]
[<ffffffffa029e79c>] ? recv_handler+0x3c/0x70 [ib_sa]
[<ffffffffa01603a4>] ? ib_mad_completion_handler+0x844/0x9d0 [ib_mad]
[<ffffffffa015fb60>] ? ib_mad_completion_handler+0x0/0x9d0 [ib_mad]
[<ffffffff81088830>] ? worker_thread+0x170/0x2a0
[<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40
[<ffffffff810886c0>] ? worker_thread+0x0/0x2a0
[<ffffffff8108ddf6>] ? kthread+0x96/0xa0
[<ffffffff8100c1ca>] ? child_rip+0xa/0x20

Coinciding with stack trace is the following message:

ib0: ib_address_create failed

The code below in ipoib_mcast_join_finish() will note the above
failure in the address handle but otherwise continue:

ah = ipoib_create_ah(dev, priv->pd, &av);
if (!ah) {
ipoib_warn(priv, "ib_address_create failed\n");
} else {

The while loop at the bottom of ipoib_mcast_join_finish() will attempt
to send queued multicast packets in mcast->pkt_queue and eventually
end up in ipoib_mcast_send():

if (!mcast->ah) {
if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
skb_queue_tail(&mcast->pkt_queue, skb);
else {
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
}

My read is that the code will requeue the packet and return to the
ipoib_mcast_join_finish() while loop and the stage is set for the
"hung" task diagnostic as the while loop never sees a non-NULL ah, and
will do nothing to resolve.

There are GFP_ATOMIC allocates in the provider routines, so this is
possible and should be dealt with.

The test that induced the failure is associated with a host SM on the
same server during a shutdown.

This patch causes ipoib_mcast_join_finish() to exit with an error
which will flush the queued mcast packets. Nothing is done to unwind
the QP attached state so that subsequent sends from above will retry
the join.

Reviewed-by: Ram Vepa <ram.vepa@qlogic.com>
Reviewed-by: Gary Leshner <gary.leshner@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# fec14d2f 29-Aug-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

infiniband: add moduleparam.h to drivers/infiniband as required

These files were getting the moduleparam infrastructure from the
implicit presence of module.h being everywhere, but that is going
away soon.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 69cce1d1 18-Jul-2011 David S. Miller <davem@davemloft.net>

net: Abstract dst->neighbour accesses behind helpers.

dst_{get,set}_neighbour()

Signed-off-by: David S. Miller <davem@davemloft.net>


# 22bedad3 01-Apr-2010 Jiri Pirko <jpirko@redhat.com>

net: convert multicast list to list_head

Converts the list and the core manipulating with it to be the same as uc_list.

+uses two functions for adding/removing mc address (normal and "global"
variant) instead of a function parameter.
+removes dev_mcast.c completely.
+exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for
manipulation with lists on a sandbox (used in bonding and 80211 drivers)

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 3e4aa12f 21-Mar-2010 Jiri Pirko <jpirko@redhat.com>

ipoib: remove addrlen check for mc addresses

Finally this bit can be removed. Currently, after the bonding driver is
changed/fixed (32a806c194ea112cfab00f558482dd97bee5e44e net-next-2.6),
that's not possible for an addr with different length than dev->addr_len
to be present in list. Removing this check as in new mc_list there will be
no addrlen in the record.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6c74651c 26-Feb-2010 Jiri Pirko <jpirko@redhat.com>

ipoib: returned back addrlen check for mc addresses

Apparently bogus mc address can break IPOIB multicast processing. Therefore
returning the check for addrlen back until this is resolved in bonding (I don't
see any other point from where mc address with non-dev->addr_len length can came
from).

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fbf219f1 23-Feb-2010 Jiri Pirko <jpirko@redhat.com>

infiniband: convert to use netdev_for_each_mc_addr

Due to the loop complexicity in nes_nic.c, I'm using char* to copy mc addresses
to it.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5ee95120 24-Sep-2009 Moni Shoua <monis@Voltaire.COM>

IPoIB: Don't turn on carrier for a non-active port

Multicast joins can succeed even if the IB port is down. This happens
when the SM runs on the same port with the requesting port. However,
IPoIB calls netif_carrier_on() when the join of the broadcast group
succeeds, without caring about the state of the IB port. The result
is an IPoIB interface in RUNNING state but without an active IB port
to support it.

If a bonding interface uses this IPoIB interface as a slave it might
not detect that this slave is almost useless and failover
functionality will be damaged. The fix checks the state of the IB
port in the carrier_task before calling netif_carrier_on().

Adresses: https://bugs.openfabrics.org/show_bug.cgi?id=1726
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 5e47596b 05-Sep-2009 Jason Gunthorpe <jgg@ziepe.ca>

IPoIB: Check multicast address format

Check that the format of multicast link addresses is correct before
taking them from dev->mc_list to priv->multicast_list. This way we
never try to send a bogus address to the SA, which prevents badness
from erronous 'ip maddr addr add', broken bonding drivers, etc.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 721d67cd 05-Sep-2009 Roland Dreier <rolandd@cisco.com>

IPoIB: Drop priv->lock before calling ipoib_send()

IPoIB currently must use irqsave locking for priv->lock, since it is
taken from interrupt context in one path. However, ipoib_send() does
skb_orphan(), and the network stack locking is not IRQ-safe.
Therefore we need to make sure we don't hold priv->lock when calling
ipoib_send() to avoid lockdep warnings (the code was almost certainly
safe in practice, since the only code path that takes priv->lock from
interrupt context would never call into the network stack).

Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=13757
Reported-by: Bart Van Assche <bart.vanassche@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# adf30907 01-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com>

net: skb->dst accessors

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3c209620 16-Jan-2009 Yossi Etigin <yosefe@Voltaire.COM>

IPoIB: Do not print error messages for multicast join retries

When IPoIB tries to join a multicast group, and the SA module's SM
address handle is NULL (because of an SM change, etc), the join
returns with -EAGAIN status. In that case, don't print an error
message unless multicast debugging is enabled.

Signed-off-by: Yossi Etigin <yosefe@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 50df48f5 12-Jan-2009 Yossi Etigin <yosefe@Voltaire.COM>

IPoIB: Do not join broadcast group if interface is brought down

Because the ipoib_workqueue is not flushed when ipoib interface is
brought down, ipoib_mcast_join() may trigger a join to the broadcast
group after priv->broadcast was set to NULL (during cleanup). This
will cause the system to be a member of the broadcast group when
interface is down. As a side effect, this breaks the optimization of
setting the Q_key only when joining the broadcast group.

Signed-off-by: Yossi Etigin <yosefe@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 5b095d989 29-Oct-2008 Harvey Harrison <harvey.harrison@gmail.com>

net: replace %p6 with %pI6

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fcace2fe 28-Oct-2008 Harvey Harrison <harvey.harrison@gmail.com>

infiniband: ipoib replace IPOIB_GID_FMT with %p6

Replace all uses of IPOIB_GID_FMT, IPOIB_GID_RAW_ARG() and IPOIB_GID_ARG()

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 943c246e 30-Sep-2008 Roland Dreier <rolandd@cisco.com>

IPoIB: Use netif_tx_lock() and get rid of private tx_lock, LLTX

Currently, IPoIB is an LLTX driver that uses its own IRQ-disabling
tx_lock. Not only do we want to get rid of LLTX, this actually causes
problems because of the skb_orphan() done with this tx_lock held: some
skb destructors expect to be run with interrupts enabled.

The simplest fix for this is to get rid of the driver-private tx_lock
and stop using LLTX. We kill off priv->tx_lock and use
netif_tx_lock[_bh]() instead; the patch to do this is a tiny bit
tricky because we need to update places that take priv->lock inside
the tx_lock to disable IRQs, rather than relying on tx_lock having
already disabled IRQs.

Also, there are a couple of places where we need to disable BHs to
make sure we have a consistent context to call netif_tx_lock() (since
we no longer can use _irqsave() variants), and we also have to change
ipoib_send_comp_handler() to call drain_tx_cq() through a timer rather
than directly, because ipoib_send_comp_handler() runs in interrupt
context and drain_tx_cq() must run in BH context so it can call
netif_tx_lock().

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# e8224e4b 16-Sep-2008 Yossi Etigin <yossi.openib@gmail.com>

IPoIB: Fix deadlock on RTNL between bcast join comp and ipoib_stop()

Taking rtnl_lock in ipoib_mcast_join_complete() causes a deadlock with
ipoib_stop(). We avoid it by scheduling the piece of code that takes
the lock on ipoib_workqueue instead of executing it directly. This
works because we only flush the ipoib_workqueue with the RTNL not held.

The deadlock happens because ipoib_stop() calls ipoib_ib_dev_down()
which calls ipoib_mcast_dev_flush(), which calls ipoib_mcast_free(),
which calls ipoib_mcast_leave(). The latter calls
ib_sa_free_multicast(), and this waits until the multicast completion
handler finishes. This handler is ipoib_mcast_join_complete(), which
waits for the rtnl_lock(), which was already taken by ipoib_stop().

This bug was introduced in commit a77a57a1 ("IPoIB: Fix deadlock on
RTNL in ipoib_stop()").

Signed-off-by: Yossi Etigin <yosefe@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# a77a57a1 19-Aug-2008 Roland Dreier <rolandd@cisco.com>

IPoIB: Fix deadlock on RTNL in ipoib_stop()

Commit c8c2afe3 ("IPoIB: Use rtnl lock/unlock when changing device
flags") added a call to rtnl_lock() in ipoib_mcast_join_task(), which
is run from the ipoib_workqueue. However, ipoib_stop() (which is run
inside rtnl_lock()) flushes this workqueue, which leads to a deadlock
if the join task is pending.

Fix this by simply not flushing the workqueue from ipoib_stop(). It
turns out that we really don't care about workqueue tasks running
during or after ipoib_stop(), as long as we make sure to flush the
workqueue before unregistering a netdev.

This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1114>.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# b9e40857 15-Jul-2008 David S. Miller <davem@davemloft.net>

netdev: Do not use TX lock to protect address lists.

Now that we have a specific lock to protect the network
device unicast and multicast lists, remove extraneous
grabs of the TX lock in cases where the code only needs
address list protection.

Signed-off-by: David S. Miller <davem@davemloft.net>


# e308a5d8 15-Jul-2008 David S. Miller <davem@davemloft.net>

netdev: Add netdev->addr_list_lock protection.

Add netif_addr_{lock,unlock}{,_bh}() helpers.

Use them to protect operations that operate on or read
the network device unicast and multicast address lists.

Also use them in cases where the code simply wants to
block calls into the driver's ->set_rx_mode() and
->set_multicast_list() methods.

Signed-off-by: David S. Miller <davem@davemloft.net>


# bd360671 15-Jul-2008 Eli Cohen <eli@mellanox.co.il>

IPoIB: Use dev_set_mtu() to change mtu

When the driver sets the MTU of the net device outside of its
change_mtu method, it should make use of dev_set_mtu() instead of
directly setting the mtu field of struct netdevice. Otherwise
functions registered to be called upon MTU change will not get called
(this is done through call_netdevice_notifiers() in dev_set_mtu()).

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# c8c2afe3 15-Jul-2008 Eli Cohen <eli@mellanox.co.il>

IPoIB: Use rtnl lock/unlock when changing device flags

Use of this lock is required to synchronize changes to the netdvice's
data structs. Also move the call to ipoib_flush_paths() after the
modification of the netdevice flags in set_mode().

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 9eae554c 15-Jul-2008 Roland Dreier <rolandd@cisco.com>

IPoIB: Get rid of ipoib_mcast_detach() wrapper

ipoib_mcast_detach() does nothing except call ib_detach_mcast(), so just
use the core API in the one place that does a multicast group detach.

add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-105 (-105)
function old new delta
ipoib_mcast_leave 357 319 -38
ipoib_mcast_detach 67 - -67

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# d0de1362 15-Jul-2008 Eli Cohen <eli@mellanox.co.il>

IPoIB: Only set Q_Key once: after joining broadcast group

The current code will set the Q_Key for any join of a non-sendonly
multicast group. The operation involves a modify QP operation, which
is fairly heavyweight, and is only really required after the join of
the broadcast group. Fix this by adding a parameter to ipoib_mcast_attach()
to control when the Q_Key is set.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# c03d4731 15-Jul-2008 Eli Cohen <eli@mellanox.co.il>

IPoIB: Remove unused IPOIB_MCAST_STARTED code

The IPOIB_MCAST_STARTED flag is not used at all since commit b3e2749b
("IPoIB: Don't drop multicast sends when they can be queued"), so
remove it.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# f3781d2e 15-Jul-2008 Roland Dreier <rolandd@cisco.com>

RDMA: Remove subversion $Id tags

They don't get updated by git and so they're worse than useless.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# e1d50dce 20-May-2008 Jack Morgenstein <jackm@dev.mellanox.co.il>

IPoIB: Test for NULL broadcast object in ipiob_mcast_join_finish()

We saw a kernel oops in our regression testing when a multicast "join
finish" occurred just after the interface was -- this is
<https://bugs.openfabrics.org/show_bug.cgi?id=1040>. The test
randomly causes the HCA physical port to go down then up.

The cause of this is that ipoib_mcast_join_finish() processing happen
just after ipoib_mcast_dev_flush() was invoked (in which case the
broadcast pointer is NULL). This patch tests for and handles the case
where priv->broadcast is NULL.

Cc: <stable@kernel.org>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# bc7b3a36 23-Apr-2008 Shirley Ma <mashirle@us.ibm.com>

IPoIB: Handle 4K IB MTU for UD (datagram) mode

This patch enables IPoIB to use 4K UD messages (when the underlying
device and fabrics support a 4K MTU) by using two scatter buffers when
PAGE_SIZE is less than or equal to thhe HCA IB MTU size. The first
buffer is for IPoIB header + GRH header, and the second buffer is the
IPoIB payload, which is 4K-4.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# b3e2749b 11-Mar-2008 Or Gerlitz <ogerlitz@voltaire.com>

IPoIB: Don't drop multicast sends when they can be queued

When set_multicast_list() is called the multicast task is restarted
and the IPOIB_MCAST_STARTED bit is cleared. As a result for some
window of time, multicast packets are not transmitted nor queued but
rather dropped by ipoib_mcast_send(). These dropped packets are
painful in two cases:

- bonding fail-over which both calls set_multicast_list() on the new
active slave and sends Gratuitous ARP through that slave.

- IP_DROP_MEMBERSHIP code which both calls set_multicast_list() on the
device and issues IGMP leave.

In both these cases, depending on the scheduling of the IPoIB
multicast task, the packets would be dropped. As a result, in the
bonding case, the failover would not be detected by the peers until
their neighbour is renewed the neighbour (which takes a few tens of
seconds). In the IGMP case, the IP router doesn't get an IGMP leave
and would only learn on that from further probes on the group (also a
delay of at least a few tens of seconds).

Fix this by allowing transmission (or queuing) depending on the
IPOIB_FLAG_OPER_UP flag instead of the IPOIB_MCAST_STARTED flag.

Signed-off-by: Olga Shern <olgas@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# a9e527e3 10-Dec-2007 Rolf Manderscheid <rvm@obsidianresearch.com>

IPoIB: improve IPv4/IPv6 to IB mcast mapping functions

An IPoIB subnet on an IB fabric that spans multiple IB subnets can't
use link-local scope in multicast GIDs. The existing routines that
map IP/IPv6 multicast addresses into IB link-level addresses hard-code
the scope to link-local, and they also leave the partition key field
uninitialised. This patch adds a parameter (the link-level broadcast
address) to the mapping routines, allowing them to initialise both the
scope and the P_Key appropriately, and fixes up the call sites.

The next step will be to add a way to configure the scope for an IPoIB
interface.

Signed-off-by: Rolf Manderscheid <rvm@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 2337f809 23-Oct-2007 Roland Dreier <rolandd@cisco.com>

IPoIB: Trivial formatting cleanups

Fix whitespace blunders, convert "foo* bar" to "foo *bar", etc.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 732a2170 09-Oct-2007 Moni Shoua <monis@voltaire.com>

IB/ipoib: Bound the net device to the ipoib_neigh structue

IPoIB uses a two layer neighboring scheme, such that for each struct neighbour
whose device is an ipoib one, there is a struct ipoib_neigh buddy which is
created on demand at the tx flow by an ipoib_neigh_alloc(skb->dst->neighbour)
call.

When using the bonding driver, neighbours are created by the net stack on behalf
of the bonding (master) device. On the tx flow the bonding code gets an skb such
that skb->dev points to the master device, it changes this skb to point on the
slave device and calls the slave hard_start_xmit function.

Under this scheme, ipoib_neigh_destructor assumption that for each struct
neighbour it gets, n->dev is an ipoib device and hence netdev_priv(n->dev)
can be casted to struct ipoib_dev_priv is buggy.

To fix it, this patch adds a dev field to struct ipoib_neigh which is used
instead of the struct neighbour dev one, when n->dev->flags has the
IFF_MASTER bit set.

Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
Acked-by: Roland Dreier <rdreier@cisco.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>


# de903512 28-Sep-2007 Roland Dreier <rolandd@cisco.com>

[IPoIB]: Convert to netdevice internal stats

Use the stats member of struct netdevice in IPoIB, so we can save
memory by deleting the stats member of struct ipoib_dev_priv, and save
code by deleting ipoib_get_stats().

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 335a64a5a 08-Oct-2007 Or Gerlitz <ogerlitz@voltaire.com>

IPoIB: Allow setting policy to ignore multicast groups

The kernel IB stack allows (through the RDMA CM) userspace
applications to join and use multicast groups from the IPoIB MGID
range. This allows multicast traffic to be handled directly from
userspace QPs, without going through the kernel stack, which gives
better performance for some applications.

However, to fully interoperate with IP multicast, such userspace
applications need to participate in IGMP reports and queries, or else
routers may not forward the multicast traffic to the system where the
application is running. The simplest way to do this is to share the
kernel IGMP implementation by using the IP_ADD_MEMBERSHIP option to
join multicast groups that are being handled directly in userspace.

However, in such cases, the actual multicast traffic should not also
be handled by the IPoIB interface, because that would burn resources
handling multicast packets that will just be discarded in the kernel.

To handle this, this patch adds lookup on the database used for IB
multicast group reference counting when IPoIB is joining multicast
groups, and if a multicast group is already handled by user space,
then the IPoIB kernel driver ignores the group. This is controlled by
a per-interface policy flag. When the flag is set, IPoIB will not
join and attach its QP to a multicast group which already has an entry
in the database; when the flag is cleared, IPoIB will behave as before
this change.

For each IPoIB interface, the /sys/class/net/$intf/umcast attribute
controls the policy flag. The default value is off/0.

Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 81668838 02-Aug-2007 Sean Hefty <sean.hefty@intel.com>

IPoIB: Specify Traffic Class with path record queries for QoS support

To support QoS within and between subnets, modify IPoIB to request
specific Traffic Class values with path record queries, using
the value associated with the IPoIB broadcast group.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>

[ See some comments I made on this at v1 and v2 of the posts
<http://lists.openfabrics.org/pipermail/general/2007-August/039275.html>
<http://lists.openfabrics.org/pipermail/general/2007-September/040312.html> ]

Reviewed-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 24bd1e4e 18-May-2007 Michael S. Tsirkin <mst@dev.mellanox.co.il>

IB/ipoib: Fix typos in error messages

Trivial error message fixups.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# e07832b6 19-Mar-2007 Sean Hefty <sean.hefty@intel.com>

IPoIB: Fix race in detaching from mcast group before attaching

There's a race between ipoib_mcast_leave() and ipoib_mcast_join_finish()
where we can try to detach from a multicast group before we've
attached to it. Fix this by reordering the code in ipoib_mcast_leave
to free the multicast group first, which waits for the multicast
callback thread (which calls ipoib_mcast_join_finish()) to complete
before detaching from the group.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 55c9adde 08-Mar-2007 Shirley Ma <xma@us.ibm.com>

IPoIB: Turn on interface's carrier after broadcast group is joined

Do netif_carrier_on() right after the IPv4 broadcast multicast group
is joined, rather than waiting for all of the initial set of multicast
group joins to finish. This allows at least IPv4 traffic to limp
along on broken fabrics where not all multicast groups can be joined.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 658bcef6 21-Feb-2007 Roland Dreier <rolandd@cisco.com>

IPoIB: Remove unused local_rate tracking

Now that low-level drivers handle the conversion from an absolute rate
to a relative rate, there's no need for the IPoIB driver to keep track
of the local port's data rate.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# faec2f7b 15-Feb-2007 Sean Hefty <sean.hefty@intel.com>

IB/sa: Track multicast join/leave requests

The IB SA tracks multicast join/leave requests on a per port basis and
does not do any reference counting: if two users of the same port join
the same group, and one leaves that group, then the SA will remove the
port from the group even though there is one user who wants to stay a
member left. Therefore, in order to support multiple users of the
same multicast group from the same port, we need to perform reference
counting locally.

To do this, add an multicast submodule to ib_sa to perform reference
counting of multicast join/leave operations. Modify ib_ipoib (the
only in-kernel user of multicast) to use the new interface.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 839fcaba 05-Feb-2007 Michael S. Tsirkin <mst@mellanox.co.il>

IPoIB: Connected mode experimental support

The following patch adds experimental support for IPoIB connected
mode, as defined by the draft from the IETF ipoib working group. The
idea is to increase performance by increasing the MTU from the maximum
of 2K (theoretically 4K) supported by IPoIB on top of UD. With this
code, I'm able to get 800MByte/sec or more with netperf without
options on a Mellanox 4x back-to-back DDR system.

Some notes on code:
1. SRQ is used for scalability to large cluster sizes
2. Only RC connections are used (UC does not support SRQ now)
3. Retry count is set to 0 since spec draft warns against retries
4. Each connection is used for data transfers in only 1 direction, so
each connection is either active(TX) or passive (RX). 2 sides that
want to communicate create 2 connections.
5. Each active (TX) connection has a separate CQ for send completions -
this keeps the code simple without CQ resize and other tricks
6. To detect stale passive side connections (where the remote side is
down), we keep an LRU list of passive connections (updated once per
second per connection) and destroy a connection after it has been
unused for several seconds. The LRU rule makes it possible to avoid
scanning connections that have recently been active.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 2745b5b7 16-Nov-2006 Michael S. Tsirkin <mst@mellanox.co.il>

IPoIB: Fix skb leak when freeing neighbour

ipoib_neigh_free() is sometimes called while neighbour is still alive,
so it might still have queued skbs. Fix skb leak in this case.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# c4028958 22-Nov-2006 David Howells <dhowells@redhat.com>

WorkStruct: make allyesconfig

Fix up for make allyesconfig.

Signed-Off-By: David Howells <dhowells@redhat.com>


# d0df6d6d 22-Sep-2006 Roland Dreier <rolandd@cisco.com>

IPoIB: Create MCGs with all attributes required by RFC

RFC 4391 ("Transmission of IP over InfiniBand (IPoIB)") says:

If the IB multicast group does not already exist, one must be
created first with the IPoIB link MTU. The MGID MUST use the same
P_Key, Q_Key, SL, MTU, and HopLimit as those used in the
broadcast-GID. The rest of attributes SHOULD follow the values used
in the broadcast-GID as well.

However, the current IPoIB driver is only setting the attributes
required by the InfiniBand spec to create a multicast group, so in
particular the MTU and HopLimit are not being set. Add these
attributes when creating MCGs, and also set the Rate attribute, since
IPoIB pays attention to that attribute as well.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# c1a0b23b 21-Aug-2006 Michael S. Tsirkin <mst@mellanox.co.il>

IB/sa: Require SA registration

Require users to register with SA module, to prevent the sa_query
module text from going away while an SA query callback is still
running. Update all in-tree users for the new interface.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 3cd96564 22-Sep-2006 Roland Dreier <rolandd@cisco.com>

IB: Whitespace fixes

Remove some trailing whitespace that has snuck in despite the best
efforts of whitespace=error-all. Also fix a few other whitespace
bogosities.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# c11bd42a 14-Sep-2006 Eli Cohen <eli@mellanox.co.il>

IPoIB: Retry failed send-only multicast group joins

When a send-only multicast group join fails, mcast->query must be set
to NULL. Otherwise, IPoIB will never retry the join and the multicast
group will never be reachable.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 624d01f8 24-Jul-2006 Or Gerlitz <ogerlitz@voltaire.com>

IB/ipoib: Fix oops with ipoib_debug_mcast set

Need to set mcast->ah before debug code dereferences it.

Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 179e0917 26-Jun-2006 Akinobu Mita <mita@miraclelinux.com>

[PATCH] drivers: use list_move()

This patch converts the combination of list_del(A) and list_add(A, B) to
list_move(A, B) under drivers/.

Acked-by: Corey Minyard <minyard@mvista.com>
Cc: Ben Collins <bcollins@debian.org>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Alasdair Kergon <dm-devel@redhat.com>
Cc: Gerd Knorr <kraxel@bytesex.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frank Pavlic <fpavlic@de.ibm.com>
Acked-by: Matthew Wilcox <matthew@wil.cx>
Cc: Andrew Vasquez <linux-driver@qlogic.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 932ff279 09-Jun-2006 Herbert Xu <herbert@gondor.apana.org.au>

[NET]: Add netif_tx_lock

Various drivers use xmit_lock internally to synchronise with their
transmission routines. They do so without setting xmit_lock_owner.
This is fine as long as netpoll is not in use.

With netpoll it is possible for deadlocks to occur if xmit_lock_owner
isn't set. This is because if a printk occurs while xmit_lock is held
and xmit_lock_owner is not set can cause netpoll to attempt to take
xmit_lock recursively.

While it is possible to resolve this by getting netpoll to use
trylock, it is suboptimal because netpoll's sole objective is to
maximise the chance of getting the printk out on the wire. So
delaying or dropping the message is to be avoided as much as possible.

So the only alternative is to always set xmit_lock_owner. The
following patch does this by introducing the netif_tx_lock family of
functions that take care of setting/unsetting xmit_lock_owner.

I renamed xmit_lock to _xmit_lock to indicate that it should not be
used directly. I didn't provide irq versions of the netif_tx_lock
functions since xmit_lock is meant to be a BH-disabling lock.

This is pretty much a straight text substitution except for a small
bug fix in winbond. It currently uses
netif_stop_queue/spin_unlock_wait to stop transmission. This is
unsafe as an IRQ can potentially wake up the queue. So it is safer to
use netif_tx_disable.

The hamradio bits used spin_lock_irq but it is unnecessary as
xmit_lock must never be taken in an IRQ handler.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 37c22a77 29-May-2006 Jack Morgenstein <jackm@mellanox.co.il>

IPoIB: Fix kernel unaligned access on ia64

Fix misaligned access faults on ia64: never cast a misaligned
neighbour->ha + 4 pointer to union ib_gid type; pass a void * pointer
instead. The memcpy was being optimized to use full word accesses
because the compiler thought that union ib_gid is always aligned.

The cast in IPOIB_GID_ARG is safe, since it is fixed to access each
byte separately.

Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# f2de3b06 05-Apr-2006 Eli Cohen <eli@mellanox.co.il>

IPoIB: Wait for join to finish before freeing mcast struct

ipoib_mcast_restart_task() might free an mcast object while a join
request is still outstanding, leading to an oops when the query
completes. Fix this by waiting for query to complete, similar to what
ipoib_stop_thread() is doing. The wait for mcast completion code is
consolidated in wait_for_mcast_join().

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# bf6a9e31 10-Apr-2006 Jack Morgenstein <jackm@mellanox.co.il>

IB: simplify static rate encoding

Push translation of static rate to HCA format into low-level drivers,
where it belongs. For static rate encoding, use encoding of rate
field from IB standard PathRecord, with addition of value 0, for
backwards compatibility with current usage. The changes are:

- Add enum ib_rate to midlayer includes.
- Get rid of static rate translation in IPoIB; just use static rate
directly from Path and MulticastGroup records.
- Update mthca driver to translate absolute static rate into the
format used by hardware. This also fixes mthca's static rate
handling for HCAs that are capable of 4X DDR.

Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# d2e0655e 04-Apr-2006 Michael S. Tsirkin <mst@mellanox.co.il>

IPoIB: Consolidate private neighbour data handling

Consolidate IPoIB's private neighbour data handling into
ipoib_neigh_alloc() and ipoib_neigh_free(). This will make it easier
to keep track of the neighbour structures that IPoIB is handling, and
is a nice cleanup of the code:

add/remove: 2/1 grow/shrink: 1/8 up/down: 100/-178 (-78)
function old new delta
ipoib_neigh_alloc - 61 +61
ipoib_neigh_free - 36 +36
ipoib_mcast_join_finish 1288 1291 +3
path_rec_completion 575 573 -2
ipoib_mcast_join_task 664 660 -4
ipoib_neigh_destructor 101 92 -9
ipoib_neigh_setup_dev 14 3 -11
ipoib_neigh_setup 17 - -17
path_free 238 215 -23
ipoib_mcast_free 329 306 -23
ipoib_mcast_send 718 684 -34
neigh_add_path 705 650 -55

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# e35fc385 20-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com>

[INFINIBAND] ipoib: Remove leftover use of neigh_ops->destructor

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8b9ab02b 07-Mar-2006 Roland Dreier <rdreier@cisco.com>

IPoIB: Fix build now that neighbour destructor is in neigh_params

Fix the IPoIB build (which is broken in net-2.6.17 because of my
screw-up, which left out this chunk in ipoib_multicast.c).
The neighbour destructor is now in neigh_params, so we don't
need to clear it in the ops structure.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 9acf6a85 02-Mar-2006 Michael S. Tsirkin <mst@mellanox.co.il>

IPoIB: Fix multicast race between canceling and completing

ipoib_mcast_stop_thread currently tests mcast->query and if it is
NULL, does not perform wait_for_completion on the mcast and frees the
mcast object directly.

However, since both operations are done without locking, it is
possible that ipoib_mcast_join_complete is in progress on this mcast
object and has set mcast->query to NULL already.

Solve this by:
- taking priv->lock before we change mcast->query in ipoib_mcast_join_complete,
and keeping it until we no longer need the mcast object
- taking priv->lock around mcast->query test in ipoib_mcast_stop_thread

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 7343b231 27-Feb-2006 Eli Cohen <eli@mellanox.co.il>

IPoIB: Close race in setting mcast->ah

ipoib_mcast_send() tests mcast->ah twice. If this value is changed
between these two points, we leak an skb. However,
ipoib_mcast_join_finish() sets mcast->ah with no locking, so it could
race against ipoib_mcast_send().

As a solution, take priv->lock around assignment to mcast->ah thus
making sure ipoib_mcast_send() (which also takes priv->lock) is not in
flight.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 20b83382 11-Feb-2006 Roland Dreier <rolandd@cisco.com>

IPoIB: Yet another fix for send-only joins

Even after the last fix, it's still possible for a send-only join to
start before the join for the broadcast group has finished. This
could cause us to create a multicast group using attributes from the
broadcast group that haven't been initialized yet, so we would use
garbage for the Q_Key, etc. Fix this by waiting until the broadcast
group's attached flag is set before starting send-only joins.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 7bcb974e 07-Feb-2006 Michael S. Tsirkin <mst@mellanox.co.il>

IPoIB: Fix another send-only join race

Further, there's an additional issue that I saw in testing:
ipoib_mcast_send may get called when priv->broadcast is NULL (e.g. if
the device was downed and then upped internally because of a port
event).

If this happends and the send-only join request gets completed before
priv->broadcast is set, we get an oops.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 479a0796 07-Feb-2006 Michael S. Tsirkin <mst@mellanox.co.il>

IPoIB: Don't start send-only joins while multicast thread is stopped

Fix the following race scenario:
- Device is up.
- Port event or set mcast list triggers ipoib_mcast_stop_thread,
this cancels the query and waits on mcast "done" completion.
- Completion is called and "done" is set.
- Meanwhile, ipoib_mcast_send arrives and starts a new query,
re-initializing "done".

Fix this by adding a "multicast started" bit and checking it before
starting a send-only join.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# b36f170b6 17-Jan-2006 Michael S. Tsirkin <mst@mellanox.co.il>

IPoIB: Lock accesses to multicast packet queues

Avoid corrupting mcast->pkt_queue by serializing access with
priv->tx_lock. Also, update dropped packet statistics to count
multicast packets removed from pkt_queue as dropped.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 95ed644f 13-Jan-2006 Ingo Molnar <mingo@elte.hu>

IB: convert from semaphores to mutexes

semaphore to mutex conversion by Ingo and Arjan's script.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Sanity-checked on real IB hardware ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 988bd503 12-Jan-2006 Eli Cohen <eli@mellanox.co.il>

IPoIB: Fix memory leak of multicast group structures

The current handling of multicast groups in IPoIB ends up never
freeing send-only multicast groups. It turns out the logic was much
more complicated than it needed to be; we can fix this bug and
completely kill ipoib_mcast_dev_down() at the same time.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 78bfe0b5 11-Jan-2006 Michael S. Tsirkin <mst@mellanox.co.il>

IPoIB: Take dev->xmit_lock around mc_list accesses

dev->mc_list accesses must be protected by dev->xmit_lock.
Found by Eli Cohen <eli@mellanox.co.il>.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 97460df3 10-Jan-2006 Eli Cohen <eli@mellanox.co.il>

IPoIB: Fix address handle refcounting for multicast groups

Multiple ipoib_neigh structures on mcast->neigh_list may point to the
same ah. This means that ipoib_mcast_free() can't just make a list of
ah structs to free, since this might end up trying to add the same ah
to the list more than once. Handle this in ipoib_multicast.c in the
same way as it is handled in ipoib_main.c for struct ipoib_path.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 70b4c8cd 10-Jan-2006 Eli Cohen <eli@mellanox.co.il>

IPoIB: Fix error path in ipoib_mcast_dev_flush()

Don't leak memory on allocation failure for broadcast mcast group.
Also, print a warning to match handling for other mcast groups.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 14c85021 26-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com>

[INET_SOCK]: Move struct inet_sock & helper functions to net/inet_sock.h

To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.

Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2e86541e 29-Nov-2005 Roland Dreier <rolandd@cisco.com>

IPoIB: don't zero members after we allocate with kzalloc

ipoib_mcast_alloc() uses kzalloc(), so there's no need to zero out
members of the mcast struct after it's allocated.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# de922487 29-Nov-2005 Michael S. Tsirkin <mst@mellanox.co.il>

IPoIB: reinitialize mcast structs' completions for every query

Make sure mcast->done is initialized to uncompleted value before we
submit a new query, so that it's safe to wait on.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 8c608a32 07-Nov-2005 Roland Dreier <rolandd@cisco.com>

[IPoIB] no need to set skb->dev right before freeing skb

For cut-and-paste reasons, the IPoIB driver was setting skb->dev right
before calling dev_kfree_skb_any(). Get rid of this.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 1732b0ef 07-Nov-2005 Roland Dreier <rolandd@cisco.com>

[IPoIB] add path record information in debugfs

Add ibX_path files to debugfs that contain information about the IPoIB
path cache. IPoIB ARP only gives GIDs, which the IPoIB driver must
resolve to real IB paths through the ib_sa module. For debugging,
when the ARP table looks OK but traffic isn't flowing, it's useful to
be able to see if the resolution from GID to path worked.

Also clean up the formatting of the existing _mcg debugfs files.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 8ae5a8a2 02-Nov-2005 Roland Dreier <rolandd@cisco.com>

[IPoIB] don't compile debug code if debugging isn't enabled

Don't build ipoib_mcast_iter_ functions if CONFIG_INFINIBAND_IPOIB_DEBUG
is not enabled -- their only callers will not be built either.

Also move the prototype for ipoib_open() to ipoib.h to fix a sparse warning.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# de6eb66b 02-Nov-2005 Roland Dreier <rolandd@cisco.com>

[IB] kzalloc() conversions

Replace kmalloc()+memset(,0,) with kzalloc(), for a net savings of 35
source lines and about 500 bytes of text.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 8d2cae06 20-Sep-2005 Roland Dreier <rolandd@cisco.com>

[PATCH] IPoIB: Don't flush workqueue from within workqueue

ipoib_mcast_restart_task() is always called from within the
single-threaded IPoIB workqueue, so flushing the workqueue from within
the function can lead to a recursion overflow. But since we're
running in a single-threaded workqueue, we're already synchronized
against other items in the workqueue, so just get rid of the flush in
ipoib_mcast_restart_task().

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# ce5b65cc 18-Sep-2005 Hal Rosenstock <halr@voltaire.com>

[PATCH] IPoIB: Fix SA client retransmission strategy

We got a little mixed up with what the backoff member holds in the
IPoIB multicast group structure: sometimes it was used as a number of
seconds, and sometimes it was used as a number of jiffies. Fix the
code so that backoff is always in seconds.

Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 97f52eb4 13-Aug-2005 Sean Hefty <sean.hefty@intel.com>

[PATCH] IB: sparse endianness cleanup

Fix sparse warnings. Use __be* where appropriate.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>


# 2a1d9b7f 11-Aug-2005 Roland Dreier <roland@eddore.topspincom.com>

[PATCH] IB: Add copyright notices

Make some lawyers happy and add copyright notices for people who
forgot to include them when they actually touched the code.

Signed-off-by: Roland Dreier <rolandd@cisco.com>


# e6ded99c 16-Apr-2005 Roland Dreier <roland@topspin.com>

[PATCH] IPoIB: fix static rate calculation

Correct and simplify calculation of static rate. We need to round up the
quotient of (local_rate - path_rate) / path_rate. To round up we add
(path_rate - 1) to the numerator, so the quotient simplifies to (local_rate -
1) / path_rate.

No idea how I came up with the old formula.

Signed-off-by: Roland Dreier <roland@topspin.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!