History log of /linux-master/drivers/net/ethernet/mellanox/mlx4/en_cq.c
Revision Date Author Comments
# b48b89f9 27-Sep-2022 Jakub Kicinski <kuba@kernel.org>

net: drop the weight argument from netif_napi_add

We tell driver developers to always pass NAPI_POLL_WEIGHT
as the weight to netif_napi_add(). This may be confusing
to newcomers, drop the weight argument, those who really
need to tweak the weight can use netif_napi_add_weight().

Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN
Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 16d083e2 04-May-2022 Jakub Kicinski <kuba@kernel.org>

net: switch to netif_napi_add_tx()

Switch net callers to the new API not requiring
the NAPI_POLL_WEIGHT argument.

Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://lore.kernel.org/r/20220504163725.550782-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 197d2370 10-Dec-2020 Thomas Gleixner <tglx@linutronix.de>

net/mlx4: Use effective interrupt affinity

Using the interrupt affinity mask for checking locality is not really
working well on architectures which support effective affinity masks.

The affinity mask is either the system wide default or set by user space,
but the architecture can or even must reduce the mask to the effective set,
which means that checking the affinity mask itself does not really tell
about the actual target CPUs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20201210194044.672935978@linutronix.de


# 80a62dee 10-Dec-2020 Thomas Gleixner <tglx@linutronix.de>

net/mlx4: Replace irq_to_desc() abuse

No driver has any business with the internals of an interrupt
descriptor. Storing a pointer to it just to use yet another helper at the
actual usage site to retrieve the affinity mask is creative at best. Just
because C does not allow encapsulation does not mean that the kernel has no
limits.

Retrieve a pointer to the affinity mask itself and use that. It's still
using an interface which is usually not for random drivers, but definitely
less hideous than the previous hack.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20201210194044.580936243@linutronix.de


# 4beaacc6 13-Dec-2018 Eric Dumazet <edumazet@google.com>

net/mlx4_en: remove fallback after kzalloc_node()

kzalloc_node(..., GFP_KERNEL, node) will attempt to allocate
memory as close as possible to the node.

There is no need to fallback to kzalloc() if this has failed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e4567897 21-Nov-2018 Daniel Jurgens <danielj@mellanox.com>

{net, IB}/mlx4: Initialize CQ buffers in the driver when possible

Perform CQ initialization in the driver when the capability is supported
by the FW. When passing the CQ to HW indicate that the CQ buffer has
been pre-initialized.

Doing so decreases CQ creation time. Testing on P8 showed a single 2048
entry CQ creation time was reduced from ~395us to ~170us, which is
2.3x faster.

Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f3eebe88 23-Jul-2017 Zhu Yanjun <yanjun.zhu@oracle.com>

mlx4_en: remove unnecessary returned value

The function mlx4_en_arm_cq always returns zero. So change the
return type of the function mlx4_en_arm_cq to void.

CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f3301870 21-Jun-2017 Moshe Shemesh <moshe@mellanox.com>

(IB, net)/mlx4: Add resource utilization support

Adding visibility of resource usage of QPs, CQs and counters used by
virtual functions. This feature will be used to give the PF administrator
more data while debugging VF status. Usage info was added to ALLOC_RES
command, to notify the PF if the resource which is being reserved or
allocated for the VF will be used by kernel driver or by user verbs.

Updated reservation and allocation functions of QP, CQ and counter with
additional usage parameter.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>


# 6c78511b 15-Jun-2017 Tariq Toukan <tariqt@mellanox.com>

net/mlx4_en: Poll XDP TX completion queue in RX NAPI

Instead of having their own NAPIs, XDP TX completion queues get
polled within the corresponding RX NAPI.
This prevents any possible race on TX ring prod/cons indices,
between the context that issues the transmits (RX NAPI) and the
context that handles the completions (was previously done in
a separate NAPI).

This also improves performance, as it decreases the number
of NAPIs running on a CPU, saving the overhead of syncing
and switching between the contexts.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.

XDP_TX packet rate:
-------------------------------------
| Before | After | Gain |
IPv4 | 12.0 Mpps | 13.8 Mpps | 15% |
IPv6 | 12.0 Mpps | 13.8 Mpps | 15% |
-------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bb07fafa 16-Nov-2016 Eric Dumazet <edumazet@google.com>

net/mlx4_en: remove napi_hash_del() call

There is no need calling napi_hash_del()+synchronize_rcu() before
calling netif_napi_del()

netif_napi_del() does this already.

Using napi_hash_del() in a driver is useful only when dealing with
a batch of NAPI structures, so that a single synchronize_rcu() can
be used. mlx4_en_deactivate_cq() is deactivating a single NAPI.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 67f8b1dc 02-Nov-2016 Tariq Toukan <tariqt@mellanox.com>

net/mlx4_en: Refactor the XDP forwarding rings scheme

Separately manage the two types of TX rings: regular ones, and XDP.
Upon an XDP set, do not borrow regular TX rings and convert them
into XDP ones, but allocate new ones, unless we hit the max number
of rings.
Which means that in systems with smaller #cores we will not consume
the current TX rings for XDP, while we are still in the num TX limit.

XDP TX rings counters are not shown in ethtool statistics.
Instead, XDP counters will be added to the respective RX rings
in a downstream patch.

This has no performance implications.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ccc109b8 02-Nov-2016 Tariq Toukan <tariqt@mellanox.com>

net/mlx4_en: Add TX_XDP for CQ types

Support XDP CQ type, and refactor the CQ type enum.
Rename the is_tx field to match the change.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 958b3d39 13-Oct-2016 Brenden Blanco <bblanco@plumgrid.com>

net/mlx4_en: fixup xdp tx irq to match rx

In cases where the number of tx rings is not a multiple of the number of
rx rings, the tx completion event will be handled on a different core
from the transmit and population of the ring. Races on the ring will
lead to a double-free of the page, and possibly other corruption.

The rings are initialized by default with a valid multiple of rings,
based on the number of cpus, therefore an invalid configuration requires
ethtool to change the ring layout. For instance 'ethtool -L eth0 rx 9 tx
8' will cause packets received on rx0, and XDP_TX'd to tx48, to be
completed on cpu3 (48 % 9 == 3).

Resolve this discrepancy by shifting the irq for the xdp tx queues to
start again from 0, modulo rx_ring_num.

Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 73898db0 04-May-2016 Haggai Abramovsky <hagaya@mellanox.com>

net/mlx4: Avoid wrong virtual mappings

The dma_alloc_coherent() function returns a virtual address which can
be used for coherent access to the underlying memory. On some
architectures, like arm64, undefined behavior results if this memory is
also accessed via virtual mappings that are not coherent. Because of
their undefined nature, operations like virt_to_page() return garbage
when passed virtual addresses obtained from dma_alloc_coherent(). Any
subsequent mappings via vmap() of the garbage page values are unusable
and result in bad things like bus errors (synchronous aborts in ARM64
speak).

The mlx4 driver contains code that does the equivalent of:
vmap(virt_to_page(dma_alloc_coherent)), this results in an OOPs when the
device is opened.

Prevent Ethernet driver to run this problematic code by forcing it to
allocate contiguous memory. As for the Infiniband driver, at first we
are trying to allocate contiguous memory, but in case of failure roll
back to work with fragmented memory.

Signed-off-by: Haggai Abramovsky <hagaya@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reported-by: David Daney <david.daney@cavium.com>
Tested-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 93d05d4a 18-Nov-2015 Eric Dumazet <edumazet@google.com>

net: provide generic busy polling to all NAPI drivers

NAPI drivers no longer need to observe a particular protocol
to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)

napi_hash_add() and napi_hash_del() are automatically called
from core networking stack, respectively from
netif_napi_add() and netif_napi_del()

This patch depends on free_netdev() and netif_napi_del() being
called from process context, which seems to be the norm.

Drivers might still prefer to call napi_hash_del() on their
own, since they might combine all the rcu grace periods into
a single one, knowing their NAPI structures lifetime, while
core networking stack has no idea of a possible combining.

Once this patch proves to not bring serious regressions,
we will cleanup drivers to either remove napi_hash_del()
or provide appropriate rcu grace periods combining.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d64b5e85 18-Nov-2015 Eric Dumazet <edumazet@google.com>

net: add netif_tx_napi_add()

netif_tx_napi_add() is a variant of netif_napi_add()

It should be used by drivers that use a napi structure
to exclusively poll TX.

We do not want to add this kind of napi in napi_hash[] in following
patches, adding generic busy polling to all NAPI drivers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b0f64463 27-Aug-2015 Carol L Soto <clsoto@linux.vnet.ibm.com>

net/mlx4_core: Fix unintialized variable used in error path

The uninitialized value name in mlx4_en_activate_cq was used in order
to print an error message. Fixing it by replacing it with cq->vector.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# de161803 31-May-2015 Ido Shamay <idos@mellanox.com>

net/mlx4_core: Move affinity hints to mlx4_core ownership

Now that EQs management is in the sole responsibility of mlx4_core,
the IRQ affinity hints configuration should be in its hands as well.
request_irq is called only once by the first consumer (maybe mlx4_ib),
so mlx4_en passes the affinity mask too late. We also need to request
vectors according to the cores we want to run on.

mlx4_core distribution of IRQs to cores is straight forward,
EQ(i)->IRQ will set affinity hint to core i.
Consumers need to request EQ vectors, according to their cores
considerations (NUMA).

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c66fa19c 31-May-2015 Matan Barak <matanb@mellanox.com>

net/mlx4: Add EQ pool

Previously, mlx4_en allocated EQs and used them exclusively.
This affected RoCE performance, as applications which are
events sensitive were limited to use only the legacy EQs.

Change that by introducing an EQ pool. This pool is managed
by mlx4_core. EQs are assigned to ports (when there are limited
number of EQs, multiple ports could be assigned to the same EQs).

An exception to this rule is the ASYNC EQ which handles various events.

Legacy EQs are completely removed as all EQs could be shared.

When a consumer (mlx4_ib/mlx4_en) requests an EQ, it asks for
EQ serving on a specific port. The core driver calculates which
EQ should be assigned to that request.

Because IRQs are shared between IB and Ethernet modules, their
names only include the PCI device BDF address.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 872bf2fb 25-Jan-2015 Yishai Hadas <yishaih@mellanox.com>

net/mlx4_core: Maintain a persistent memory for mlx4 device

Maintain a persistent memory that should survive reset flow/PCI error.
This comes as a preparation for coming series to support above flows.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 858e6c32 16-Jul-2014 Amir Vadai <amirv@mellanox.com>

net/mlx4_en: cq->irq_desc wasn't set in legacy EQ's

Fix a regression introduced by commit 35f6f45 ("net/mlx4_en: Don't use
irq_affinity_notifier to track changes in IRQ affinity map").
When core is started in legacy EQ's (number of IRQ's < rx rings), cq->irq_desc
was NULL. This caused a kernel crash under heavy traffic - when having more
than rx NAPI budget completions.
Fixed to have it set for both EQ modes.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bb273617 29-Jun-2014 Amir Vadai <amirv@mellanox.com>

net/mlx4_en: IRQ affinity hint is not cleared on port down

Need to remove affinity hint at mlx4_en_deactivate_cq() and not at
mlx4_en_destroy_cq() - since affinity_mask might be free'd while still
being used by procfs.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 35f6f453 29-Jun-2014 Amir Vadai <amirv@mellanox.com>

net/mlx4_en: Don't use irq_affinity_notifier to track changes in IRQ affinity map

IRQ affinity notifier can only have a single notifier - cpu_rmap
notifier. Can't use it to track changes in IRQ affinity map.
Detect IRQ affinity changes by comparing CPU to current IRQ affinity map
during NAPI poll thread.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ben Hutchings <ben@decadent.org.uk>
Fixes: 2eacc23 ("net/mlx4_core: Enforce irq affinity changes immediatly")
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9e311e77 09-Jun-2014 Yuval Atias <yuvala@mellanox.com>

net/mlx4_en: Use affinity hint

The “affinity hint” mechanism is used by the user space
daemon, irqbalancer, to indicate a preferred CPU mask for irqs.
Irqbalancer can use this hint to balance the irqs between the
cpus indicated by the mask.

We wish the HCA to preferentially map the IRQs it uses to numa cores
close to it. To accomplish this, we use cpumask_set_cpu_local_first(), that
sets the affinity hint according the following policy:
First it maps IRQs to “close” numa cores. If these are exhausted, the
remaining IRQs are mapped to “far” numa cores.

Signed-off-by: Yuval Atias <yuvala@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 96b2e73c 02-Jun-2014 David S. Miller <davem@davemloft.net>

Revert "net/mlx4_en: Use affinity hint"

This reverts commit 70a640d0dae3a9b1b222ce673eb5d92c263ddd61.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 70a640d0 25-May-2014 Yuval Atias <yuvala@mellanox.com>

net/mlx4_en: Use affinity hint

The “affinity hint” mechanism is used by the user space
daemon, irqbalancer, to indicate a preferred CPU mask for irqs.
Irqbalancer can use this hint to balance the irqs between the
cpus indicated by the mask.

We wish the HCA to preferentially map the IRQs it uses to numa cores
close to it. To accomplish this, we use cpumask_set_cpu_local_first(), that
sets the affinity hint according the following policy:
First it maps IRQs to “close” numa cores. If these are exhausted, the
remaining IRQs are mapped to “far” numa cores.

Signed-off-by: Yuval Atias <yuvala@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1a91de28 07-May-2014 Joe Perches <joe@perches.com>

mellanox: Logging message cleanups

Use a more current logging style.

o Coalesce formats
o Add missing spaces for coalesced formats
o Align arguments for modified formats
o Add missing newlines for some logging messages
o Use DRV_NAME as part of format instead of %s, DRV_NAME to
reduce overall text.
o Use ..., ##__VA_ARGS__ instead of args... in macros
o Correct a few format typos
o Use a single line message where appropriate

Signed-off-by: Joe Perches <joe@perches.com>
Acked-By: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c98235cb 15-Apr-2014 Chris Mason <clm@fb.com>

mlx4_en: don't use napi_synchronize inside mlx4_en_netpoll

The mlx4 driver is triggering schedules while atomic inside
mlx4_en_netpoll:

spin_lock_irqsave(&cq->lock, flags);
napi_synchronize(&cq->napi);
^^^^^ msleep here
mlx4_en_process_rx_cq(dev, cq, 0);
spin_unlock_irqrestore(&cq->lock, flags);

This was part of a patch by Alexander Guller from Mellanox in 2011,
but it still isn't upstream.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
Acked-By: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0276a330 19-Dec-2013 Eugenia Emantayev <eugenia@mellanox.com>

net/mlx4_en: Add NAPI support for transmit side

Add NAPI for TX side,
implement its support and provide NAPI callback.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 163561a4 06-Nov-2013 Eugenia Emantayev <eugenia@mellanox.com>

net/mlx4_en: Datapath structures are allocated per NUMA node

For each RX/TX ring and its CQ, allocation is done on a NUMA node that
corresponds to the core that the data structure should operate on.
The assumption is that the core number is reflected by the ring index.
The affected allocations are the ring/CQ data structures,
the TX/RX info and the shared HW/SW buffer.
For TX rings, each core has rings of all UPs.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 41d942d5 06-Nov-2013 Eugenia Emantayev <eugenia@mellanox.com>

net/mlx4_en: Datapath resources allocated dynamically

Currently all TX/RX rings and completion queues are part of the
netdev priv structure and are allocated statically. This patch
will change the priv to hold only arrays of pointers and therefore
all TX/RX rings and completetion queues will be allocated
dynamically. This is in preparation for NUMA aware allocations.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9e77a2b8 18-Jun-2013 Amir Vadai <amirv@mellanox.com>

net/mlx4_en: Add Low Latency Socket (LLS) support

Add basic support for LLS.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ec693d47 23-Apr-2013 Amir Vadai <amirv@mellanox.com>

net/mlx4_en: Add HW timestamping (TS) support

The patch allows to enable/disable HW timestamping for incoming and/or
outgoing packets. It adds and initializes all structs and callbacks
needed by kernel TS API.
To enable/disable HW timestamping appropriate ioctl should be used.
Currently HWTSTAMP_FILTER_ALL/NONE and HWTSAMP_TX_ON/OFF only are
supported.
When enabling TS on receive flow - VLAN stripping will be disabled.
Also were made all relevant changes in RX/TX flows to consider TS request
and plant HW timestamps into relevant structures.
mlx4_ib was fixed to compile with new mlx4_cq_alloc() signature.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 08ff3235 21-Oct-2012 Or Gerlitz <ogerlitz@mellanox.com>

mlx4: 64-byte CQE/EQE support

ConnectX-3 devices can use either 64- or 32-byte completion queue
entries (CQEs) and event queue entries (EQEs). Using 64-byte
EQEs/CQEs performs better because each entry is aligned to a complete
cacheline. This patch queries the HCA's capabilities, and if it
supports 64-byte CQEs and EQES the driver will configure the HW to
work in 64-byte mode.

The 32-byte vs 64-byte mode is global per HCA and not per CQ or EQ.

Since this mode is global, userspace (libmlx4) must be updated to work
with the configured CQE size, and guests using SR-IOV virtual
functions need to know both EQE and CQE size.

In case one of the 64-byte CQE/EQE capabilities is activated, the
patch makes sure that older guest drivers that use the QUERY_DEV_FUNC
command (e.g as done in mlx4_core of Linux 3.3..3.6) will notice that
they need an update to be able to work with the PPF. This is done by
changing the returned pf_context_behaviour not to be zero any more. In
case none of these capabilities is activated that value remains zero
and older guest drivers can run OK.

The SRIOV related flow is as follows

1. the PPF does the detection of the new capabilities using
QUERY_DEV_CAP command.

2. the PPF activates the new capabilities using INIT_HCA.

3. the VF detects if the PPF activated the capabilities using
QUERY_HCA, and if this is the case activates them for itself too.

Note that the VF detects that it must be aware to the new PF behaviour
using QUERY_FUNC_CAP. Steps 1 and 2 apply also for native mode.

User space notification is done through a new field introduced in
struct mlx4_ib_ucontext which holds device capabilities for which user
space must take action. This changes the binary interface so the ABI
towards libmlx4 exposed through uverbs is bumped from 3 to 4 but only
when **needed** i.e. only when the driver does use 64-byte CQEs or
future device capabilities which must be in sync by user space. This
practice allows to work with unmodified libmlx4 on older devices (e.g
A0, B0) which don't support 64-byte CQEs.

In order to keep existing systems functional when they update to a
newer kernel that contains these changes in VF and userspace ABI, a
module parameter enable_64b_cqe_eqe must be set to enable 64-byte
mode; the default is currently false.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>


# 1eb8c695 18-Jul-2012 Amir Vadai <amirv@mellanox.com>

net/mlx4_en: Add accelerated RFS support

Use RFS infrastructure and flow steering in HW to keep CPU
affinity of rx interrupts and application per TCP stream.

A flow steering filter is added to the HW whenever the RFS
ndo callback is invoked by core networking code.

Because the invocation takes place in interrupt context, the
actual setup of HW is done using workqueue. Whenever new filter
is added, the driver checks for expiry of existing filters.

Since there's window in time between the point where the core
RFS code invoked the ndo callback, to the point where the HW
is configured from the workqueue context, the 2nd, 3rd etc
packets from that stream will cause the net core to invoke
the callback again and again.

To prevent inefficient/double configuration of the HW, the filters
are kept in a database which is indexed using hash function to enable
fast access.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d9236c3f 18-Jul-2012 Amir Vadai <amirv@mellanox.com>

{NET,IB}/mlx4: Add rmap support to mlx4_assign_eq

Enable callers of mlx4_assign_eq to supply a pointer to cpu_rmap.
If supplied, the assigned IRQ is tracked using rmap infrastructure.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e22979d9 22-Apr-2012 Yevgeny Petrilin <yevgenyp@mellanox.co.il>

mlx4_en: Moving to Interrupts for TX completions

Moving to interrupts instead of polling fpr TX completions
Avoiding situations where skb can be held in by the driver for
a long time (till timer expires).
The change is also necessary for supporting BQL.

Removing comp_lock that was required because we could handle TX
completions from several contexts: Interrupts, timer, polling.
Now there is only interrupts

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cd3109d2 28-Dec-2011 Yevgeny Petrilin <yevgenyp@mellanox.co.il>

mlx4_en: nullify cq->vector field when closing completion queue

Caused loss of connectivity when changing ring size.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f0ab34f0 26-Nov-2011 Yevgeny Petrilin <yevgenyp@mellanox.co.il>

net/mlx4_en: using non collapsed CQ on TX

Moving to regular Completion Queue implementation (not collapsed)
Completion for each transmitted packet is written to new entry.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fe0af03c 08-Oct-2011 Alexander Guller <alexg@mellanox.com>

mlx4_en: Removing reserve vectors

Fixed a bug where ring size change caused insufficient memory
upon driver restart due to unreleased EQs.

Signed-off-by: Alexander Guller <alexg@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 76532d0c 08-Oct-2011 Alexander Guller <alexg@mellanox.com>

mlx4_en: Assigning TX irq per ring

Until now only RX rings used irq per ring
and TX used only one per port.
>From now on, both of them will use the
irq per ring while RX & TX ring[i] will
use the same irq.

Signed-off-by: Alexander Guller <alexg@mellanox.co.il>
Signed-off-by: Sharon Cohen <sharonc@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a2cc190 13-May-2011 Jeff Kirsher <jeffrey.t.kirsher@intel.com>

mlx4: Move the Mellanox driver

Moves the Mellanox driver into drivers/net/ethernet/mellanox/ and
make the necessary Kconfig and Makefile changes.

CC: Roland Dreier <roland@kernel.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>